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Foreword 


Signal models are at the heart of signal processing, be it for understanding, 
estimation, compression, or synthesis. While Fourier’s expansion of functions 
into a weighted sum of sinusoids is the earliest model for signals, D. Gabor’s lo- 
calized version of the Fourier transform is probably the earliest model for “real” 
signals. In his 1946 paper, Gabor introduces the idea of a time-frequency atom 
as an elementary object from which arbitrary signals can be constructed us- 
ing shifting and modulation. A variation on this theme, due to J. Morlet in 
1982, uses shifting and scaling and leads to wavelet transforms. Both Gabor’s 
and Morlet’s signal models are nonparametric — or generic in the sense of not 
assuming an underlying signal structure. In the case of speech and musical 
signals, parametric models have a long tradition. For example, the idea of a 
vocoder dates back to the 1930’s, and banks of tuned oscillators served as the 
building blocks for early music synthesizers. While such parametric models 
potentially allow very compact representations, they are not necessarily as uni- 
versal aS nonparametric approaches. And furthermore, real-world signals do 
not necessarily fit within the straitjackets we have been able to build so far! 


Michael Goodwin’s book, based on his PhD thesis at UC Berkeley, explores 
the boundary between parametric and generic models of signals, in particular 
for high quality audio representations. This is an exciting field, with many new 
applications having appeared in recent years. The understanding and the de- 
velopment of good models for digital audio is critical for high quality coders, for 
example, MPEG-2 audio coders. Such models are equally important in efficient 
analysis and synthesis of sound, as used in state-of-the-art music synthesizers. 
Many other facets of music processing also require sophisticated models, in- 
cluding manipulation (modification), understanding, and watermarking; these 
are just a few of the emerging tasks in audio processing. 


The book starts off with an excellent review of signal modeling, both in 
the parametric and nonparametric cases. The point of view is that of time- 
frequency representations. Next, in Chapter 2, sinusoidal modeling is consid- 
ered, and its variants are described in detail. This leads in Chapter 3 to mul- 
tiresolution approaches to sinusoidal modeling; this is an example of a wavelet- 
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based concept (namely multiresolution) applied to a classic sinusoid-based al- 
gorithm, leading to original and improved results. Chapter 4 then considers 
the key task of modeling the residual after sinusoids have been removed from 
the signal. This is a standard problem in analysis-by-synthesis methods, where 
the coherent components (the sinusoids) have been taken care of, but the resid- 
ual is still of importance for a natural rendition of the signal. Chapter 5 then 
deals with pitch-synchronous models, where the pseudo-periodicity of signals 
like speech is taken into account. Again, both sinusoidal and wavelet views are 
developed. Finally, Chapter 6 considers a more recent time-frequency method, 
namely matching pursuit, and presents algorithms tuned for audio applications. 

It is noteworthy that this book presents in a unified manner both review ma- 
terial and state-of-the-art research results. Thus, the material can be used both 
as an introduction to the area of time-frequency representations for audio and 
as a source of advanced results for research in the field. I believe Michael Good- 
win’s book will be helpful to students and researchers alike, and is therefore a 
welcome addition to the literature on digital audio processing. 


Martin Vetterli 
Professor of Communication Systems 
Swiss Federal Institute of Technology, Lausanne 
Adjunct Professor of Electrical Engineering 
and Computer Sciences 
University of California, Berkeley 


Preface 


This book is based on the electrical engineering doctoral thesis that I wrote at 
UC Berkeley during the summer of 1997. The process all started in 1993 when 
I began working as a graduate student researcher at the Center for New Music 
and Audio Technologies (CNMAT), a computer music lab affiliated with the 
university. My main project there involved modeling the noiselike components 
of musical sounds, for instance the breath noise of a flute. This residual mod- 
eling entailed dealing with the leftovers, or residual, after the primary musical 
features of a signal had been extracted by an analysis-synthesis process. 


After I had thought about the leftovers for a while, it was only natural to 
consider where the leftovers had come from. For the next few years, then, 
I worked on a variety of full signal models involving filter banks, sinuoids, 
wavelets, time-frequency atoms, and so on. Autumns came and went, and it 
became time to write a dissertation. At that point, it seemed to me that the 
ideas I had worked on were mostly unrelated and that it would be somewhat 
difficult to organize my stack of notes into a cohesive document. When I wrote 
the outline, though, all of the pieces seemed to fit in the conceptual framework 
of adaptive signal modeling, and without too much nudging even! 


The end of this history is that there is now a book about the signal models 
I worked on and how they relate to other methods that have been considered 
in the literature. Assembling this material has been an exciting venture, and 
I hope that the text leaves the reader with a sense of that excitement. Re 
cently, an acquaintance suggested that the excitement has indeed faded from 
signal processing since the field is mature and in some sense complete. I would 
contend that while we certainly have learned volumes about how to build and 
manipulate mathematical models of the natural world, there are countless av- 
enues as yet unexplored, and I for one am looking forward to finding out where 
they go. 
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1 SIGNAL MODELS AND 
ANALYSIS-SYNTHESIS 


Description is revelation. It 1s not 
The thing described... 


— Wallace Stevens, “Description Without Place” 


ele term signal modeling refers to the task of describing a signal with 
respect to an underlying structure — a model of the signal’s fundamental be- 
havior. Analysts is the process of fitting such a model to a particular signal, 
and synthesis is the process by which a signal is reconstructed using the model 
and the analysis data. This chapter discusses the basic theory and applications 
of signal models, especially those in which a signal is represented as a weighted 
sum of simple components; such models are the focus of this book. For the most 
part, the models to be considered are tailored for application to audio signals; 
in anticipation of this, examples related to audio are employed throughout the 
introduction to shed light on general modeling issues. 


1.1 ANALYSIS-SYNTHESIS SYSTEMS 


Signal modeling methods can be interpreted in the conceptual framework of 
analysis-synthesis. A general analysis-synthesis system for signal modeling is 
shown in Figure 1.1. The analysis block derives data pertaining to the signal 
model; this data is used by the synthesis block to construct a signal estimate. 
When the estimate is not perfect, the difference between the original x[n] and 
the reconstruction £[n] is nonzero; this difference signal r[n] = z[n] — £[n] is 
termed the residual. The analysis-synthesis framework for signal modeling is 
developed further in the following sections. 


1.1.1 Signal Representations 


A wide variety of models can be cast into the analysis-synthesis framework of 
Figure 1.1. Two specific cases that illustrate relevant issues will be considered 
here: filter banks and physical models. 
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z[n| 
Original Synthesis z([n] 
signal if) Reconstruction 


Signal model data 


Residual 


Figure 1.1. An analysis-synthesis framework for signal modeling. The analysis block de- 
rives the model data for the signal z[n]; the synthesis block constructs a signal estimate 
&|n] based on the analysis data. If the reconstruction is not perfect, there is a nonzero 
residual r[n]. 


Filter banks. A common approach to signal modeling involves using anal- 
ysis and synthesis blocks consisting of filter banks. In such methods, the sig- 
nal model consists of the subband signals derived by the analysis bank plus 
a description of the synthesis filter bank; reconstruction is carried out by 
applying the subband signals to the synthesis filters and accumulating their 
respective outputs. This filter bank scenario has been extensively consid- 
ered in the literature. A few examples of filter bank techniques are short- 
time Fourier transforms [190], discrete wavelet transforms [238], discrete cosine 
transforms [195], lapped orthogonal transforms [141], and perceptual coding 
schemes wherein the filter bank is designed to mimic or exploit the properties 
of the human auditory or visual systems [26, 79, 92, 91, 109, 166, 223]. Such 
filter-based techniques have been widely applied in audio and image coding 
(26, 166, 178, 177, 205, 209, 210, 211, 212, 228] and a wide variety of designs 
and structures for analysis-synthesis filter banks have been proposed [232, 238]. 


Physical models. A significantly different situation arises in the case of 
physical modeling of musical instruments [217, 215], which is a generalization 
of the source-filter approaches that are commonly used in speech processing 
applications [78, 79, 97, 136, 157, 200]. In source-filter methods, the analysis 
consists of deriving a filter and choosing an appropriate source such that when 
the filter is driven by the source, the output is a reasonable estimate of the 
original signal; in some speech coding algorithms, the source mimics a glottal 
excitation while the filter models the shape of the vocal tract, meaning that the 
source-filter structure is designed to mirror the actual underlying physical sys- 
tem from which the speech signal originated. In physical modeling, this idea is 
extended to the case of arbitrary instruments, where both linear and nonlinear 
processing is essential to model the physical system [217]. Here, the purpose 
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of the analysis is to derive a general physical description of the instrument in 
question. That physical description, which constitutes the signal model data in 
this case, is used to construct a synthesis system that mimics the instrument’s 
behavior. In a guitar model, for instance, the model parameters derived by the 
analysis include the length, tension, and various wave propagation characteris- 
tics of the strings, the acoustic resonances of the guitar body, and the transfer 
properties of the string—body coupling. These physical parameters can be used 
to build a system that, when driven by a modeled excitation such as a string 
pluck, synthesizes a realistic guitar sound [59, 112, 106, 217]. 


Mathematical and physical models. In either of the above cases, the 
signal model and the analysis-synthesis process are inherently connected: in 
the filter bank case, the signal is modeled as an aggregation of subbands; in a 
physical model, the signal is interpreted as the output of a complex physical 
system. While these representations are significantly different, they share a 
common conceptual framework in that the synthesis is driven by data from 
the analysis, and in that both the analysis and synthesis are carried out in 
accordance with an underlying signal model. | 

In the literature, physical models and signal models are typically differenti- 
ated. The foundation for this distinction is that physical models are concerned 
with the systems that are responsible for generating the signal in question, 
whereas signal models, in the strictest sense, are purely concerned with a math- 
ematical approximation of the signal irrespective of its source — the signal is 
not estimated via an approximation of the generating physical system. As 
suggested in the previous sections, this differentiation is somewhat immaterial; 
both approaches provide a representation of a signal in terms of a model and 
corresponding data. Certainly, physical models rely on mathematical analy- 
sis; on the other hand, mathematical models are frequently based on physical 
considerations. While the models examined in this book are categorically math- 
ematical, in each case the representation is supported by underlying physical 
principles, e.g. pitch periodicity. 


Additive models. The general topic of this book is mathematical signal 
modeling; as stated above, the models are improved by physical insights. The 
designation of a model as mathematical is rather general, though. More specif- 
ically, the focus of this book is additive signal models of the form 


z[n] = > _ aigi(n), (1.1) 


i=1 


wherein a signal is represented as a weighted sum of basic components; such 
models are referred to as decompositions or expansions. Of particular interest 
in these types of models is the capability of successive refinement. As will be 
seen, modeling algorithms can be designed such that the signal approximation 
is successively improved as the number of elements in the decomposition is 
increased; the improvement is measured using a metric such as mean-square 
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error. This notion suggests another similarity between mathematical and phys- 
ical models; in either case, the signal estimate is improved by making the model 
more complex — either by using a more complicated physical model or by using 
more terms in the expansion. In this light, the advantage of additive models 
is that the model enhancement is carried out by relatively simple mathematics 
rather than complicated physical analyses as in the physical modeling case. 

Signal models of the form given in Equation (1.1) are traditionally grouped 
into two categories: parametric and nonparametric. The fundamental distinc- 
tion is that in nonparametric methods, the components g;|[n] are a fixed func- 
tion set, such as a basis; standard transform coders, for instance, belong to this 
class. In parametric methods, on the other hand, the components are derived 
using parameters extracted from the signal. These issues will be discussed fur- 
ther throughout this book; for instance, it will be shown in Chapter 6 that the 
inherent signal adaptivity of parametric models can be achieved in models that 
are nonparametric according to this definition. In other words, for some types 
of models the distinction is basically moot. 

General additive models have been under consideration in the field of com- 
puter music since its inception [70, 155, 197, 198, 199]. The basic idea of such 
additive synthesis is that a complex sound can be constructed by accumulating 
a large number of simple sounds. This notion is essential to the task of model- 
ing musical signals; it is discussed further in the section on granular synthesis 
(Section 1.5.4) and is an underlying theme of this book. 


1.1.2 Perfect and Near-Perfect Reconstruction 


Filter banks satisfying perfect reconstruction constraints have received consid- 
erable attention in the literature [232, 238]. The term “perfect reconstruction” 
was coined to describe analysis-synthesis filter banks where the reconstruction 
is an exact duplicate of the original, with the possible exception of a time delay 
and a scale factor: 


é[n] = Az|n —d]. (1.2) 


This notion, however, is by no means limited to the case of filter bank models; 
any model that meets the above requirement can be classified as a perfect 
reconstruction approach. Throughout, A = 1 and 6 = 0 will often be assumed 
without loss of generality. 

In perfect reconstruction systems, provided that the gain and delay are com- 
pensated for, the residual signal indicated in Figure 1.1 is uniformly zero. In 
practice, however, perfect reconstruction is not generally achievable; in the 
filter bank case, for instance, subband quantization effects and channel noise 
interfere with the reconstruction process. Given these inherent difficulties with 
implementing perfect reconstruction systems, the design of near-perfect recon- 
struction systems has been considered for filter bank models as well as more 
general cases. In these approaches, the models are designed such that the 
reconstruction error has particular properties; for instance, filter banks for au- 
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dio coding are typically formulated with the intent of using auditory masking 
principles to render the reconstruction error imperceptible (26, 79, 92, 91, 223]. 

As stated, signal models typically cannot achieve perfect reconstruction. 
This is particularly true in cases where the representation contains less data 
than the original signal, i.e. in cases where compression is achieved. Beyond 
those cases, some models, regardless of compression considerations, simply do 
not account for perfect reconstruction. In audiovisual applications, these situ- 
ations can be viewed in light of a looser near-perfect reconstruction criterion, 
that of perceptual losslessness or transparency, which is achieved in an analysis- 
synthesis system if the reconstructed signal is perceptually equivalent to the 
original. Note that a perceptually lossless system typically invokes psychophys- 
ical phenomena such as masking to effect data reduction or compression; its 
signal representation may be more efficient than that of a perfect reconstruction 
system. 


The notion of perceptual losslessness can be readily interpreted in terms of 
the analysis-synthesis structure of Figure 1.1. For one, a perfect reconstruction 
system is clearly lossless in this sense. In near-perfect models, however, to 
achieve perceptual losslessness it is necessary that either the analysis-synthesis 
residual contain only components that would be perceptually insignificant in 
the synthesis, or that the residual be modeled separately and reinjected into 
the reconstruction. The latter case is most general. 

As will be demonstrated in Chapter 2, the residual characteristically contains 
signal features that are not well-represented by the signal model, or in other 
words, components that the analysis is not designed to identify and that the 
synthesis is not capable of constructing. If these components are important 
(perceptually or otherwise) it is necessary to introduce a distinct model for 
the residual that can represent such features appropriately. Such signal-plus- 
residual models have been applied to many signal processing problems; this is 
considered further in Chapter 4. 


The signal models discussed in this book are generally near-perfect recon- 
struction approaches tailored for audio applications. For the sake of compres- 
sion or data reduction, perceptually unimportant information is removed from 
the representation. Thus, it is necessary to incorporate notions of perceptual 
relevance in the models. For music, it is well-known that high-quality synthesis 
requires accurate reproduction of note onsets or attacks [26, 208], which is in 
some sense analogous to the need for accurate reproduction of edges in image 
coding. This so-called attack problem will be addressed in each signal model; 
it provides a foundation for assessing the suitability of a model for musical 
signals. For approximate models of audio signals, the distortion of attacks, 
often described using the term pre-echo, leads to a visual cue for evaluating the 
models; comparative plots of original and reconstructed attacks are a reliable 
indicator of the relative auditory percepts. 

Issues similar to the attack problem commonly arise in signal processing ap- 
plications. In many analysis-synthesis scenarios, it is important to accurately 
model specific signal features; other features are relatively unimportant and 
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need not be accurately represented. In other words, the reconstruction error 
measure depends on the very nature of the signal and the applications of the 
representation. One example of this is compression of ambulatory electrocar- 
diogram signals for future off-line analysis; for this purpose it is only important 
to preserve a few key features of the heartbeat signal, and thus high compression 
rates can be achieved [128]. 


1.2 COMPACT REPRESENTATIONS 


Two very different models were discussed in Section 1.1.1, namely filter bank 
and physical models. These examples suggest the wide range of modeling tech- 
niques that exist; despite this variety, a few general observations can be made. 
Any given model is only useful inasmuch as it provides a signal description that 
is pertinent to the application at hand; in general, the usefulness of a model is 
difficult to assess without a prior: knowledge of the signal. Given an accurate 
model, a reasonable metric for further evaluation is the compaction of the rep- 
resentation that the model provides. If a representation is both accurate and 
compact, t.e. is not data intensive, then it can be concluded that the represen- 
tation captures the primary or meaningful signal behavior; a compact model 
in some sense extracts the coherent structure of a signal [42, 139]. This insight 
suggests that accurate compact representations are applicable to the tasks of 
compression, denoising, analysis, and signal modification; these are discussed 
in turn. 


1.2.1 Compression 


It is perhaps obvious that by definition a compact representation is useful 
for compression. In terms of the additive signal model of Equation (1.1), a 
compact representation is one in which only a few of the model components 
a,gi|n| are significant. With regards to accurate waveform reconstruction, such 
compaction is achieved when only a few coefficients a; have significant values, 
provided of course that the functions g;[n] all have the same norm. Then, 
negligible components can be thresholded, #.e. set to zero, without substantially 
degrading the signal reconstruction. In scenarios where perceptual criteria are 
relevant in determining the quality of the reconstruction, principles such as 
auditory masking can be invoked to achieve compaction; in some cases, masking 
phenomena can be used to justify neglecting components with relatively large 
coefficients. 

In expansions where the coefficients are all of similar value, thresholding 
is generally not useful and compaction cannot be readily achieved; this issue 
will come up again in Section 1.4 and Chapter 6. To derive signal models for 
coding, then, it is necessary to employ algorithms specifically designed to arrive 
at compact models. Various algorithms for computing signal expansions have 
focused on optimizing compaction metrics such as the entropy or L; norm of 
the coefficients or the rate-distortion performance of the representation; these 
approaches allow for an exploration of the tradeoff between the amount of data 
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in the representation and its accuracy in modeling the signal (31, 32, 37, 191]. 
Note that for the remainder of this book the terms compression and compaction 
will for the most part be used interchangeably. 


1.2.2. Denoising 


It has been argued that compression and denoising are linked [159]. This argu- 
ment is based on the observation that white noise is essentially incompressible; 
for instance, an orthogonal transform of white noise is again white, 1.e. there 
is no compaction in the transform data and thus no compression is achievable. 
In cases where a coherent signal is degraded by additive white noise, the noise 
in the signal is not compressible. Then, a compressed representation does not 
capture the noise; it extracts the primary structure of the signal and a re- 
construction based on such a compact model is in some sense a denoised or 
enhanced version of the original. In cases where the signal is well-modeled as a 
white noise process and the degradations are coherent, e.g. digital data with a 
sinusoidal jammer, this argument does not readily apply. 

In addition to the filter-based considerations of [159], the connection between 
compression and denoising has been explored in the Fourier domain [25] and 
in the wavelet domain [51]. In these approaches, the statistical assumption is 
that small expansion coefficients correspond to noise instead of important signal 
features; as a result, thresholding the coefficients results in denoising. There 
are various results in the literature for thresholding wavelet-based representa- 
tions [51]; such approaches have been applied with some success to denoising 
old sound recordings [19, 20]. Furthermore, motivated by the observation that 
quantization is similar to a thresholding operation, there have been recent con- 
siderations of quantization as a denoising approach [30]. 

It is interesting to note that denoising via thresholding has an early corre- 
spondence in time-domain speech processing for dereverberation and removing 
background noise [21, 152]. In that method, referred to as center clipping, a 
signal is set to zero if it is below a threshold; if it is above the threshold, the 
threshold is subtracted. For a threshold a, the center-clipped signal is 


#{n] = { in] — a vin ca. (1.3) 


which corresponds to soft-thresholding the signal in the time domain rather 
than in a transform domain as in the methods discussed above.! This approach 
was considered effective for removing long-scale reverberation, i.e. echoes that 
linger after the signal is no longer present; such reverberation decreases the 
intelligibility of speech; also, center clipping is useful as a front end for pitch 
detection of speech and audio signals [190, 219]. 


1This kind of thresholding nonlinearity does not necessarily yield objectionable perceptual 
artifacts in speech signals; a similar nonlinearity has been used to improve the performance 
of stereo echo cancellation without degrading the speech quality [17]. 
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1.2.3. Analysis, Detection, and Estimation 


In an accurate compact representation, the primary structures of the signal are 
well-modeled. Given the representation, then, it is possible to determine the 
basic behavior of the signal. Certain patterns of behavior, if present in the 
signal, can be clearly identified in the representation, and specific parameters 
relating to that behavior can be extracted from the model. In this light, a 
compact representation enables signal analysis and characterization as well as 
the related tasks of detection, identification, and estimation. 


1.2.4 Modification 


In audio applications, it is often desirable to carry out modifications such as 
time-scaling, pitch-shifting, and cross-synthesis. Time-scaling refers to altering 
the duration of a sound without changing its pitch; pitch-shifting, inversely, 
refers to modifying the perceived pitch of a sound without changing its dura- 
tion. Finally, cross-synthests is the process by which two sounds are merged 
in a meaningful way; an example of this is applying a guitar string excitation 
to a vocal tract filter, resulting in a “talking” guitar [67]. These modifications 
cannot be carried out flexibly and effectively using commercially available sys- 
tems such as samplers or frequency-modulation (FM) synthesizers [34]. For this 
reason, it is of interest to explore the possibility of carrying out modifications 
based on additive signal models. 

A signal model is only useful with regard to musical modifications if it identi- 
fies musically relevant features of the signal such as pitch and harmonic struc- 
ture; thus, a certain amount of analysis is a prerequisite to modification ca- 
pabilities. Furthermore, data reduction is of significant interest for efficient 
implementations. Such compression can be achieved via the framework of per- 
ceptual losslessness; the signal model can be simplified by exploiting the prin- 
ciples of auditory perception and masking. This simplification, however, can 
only be carried out if the model components can individually be interpreted 
in terms of perceptually relevant parameters. If the components are perceptu- 
ally motivated, their structure can be modified in perceptually predictable and 
meaningful ways. Thus, a compact transparent representation in some sense 
has inherent modification capabilities. Given this interrelation of data reduc- 
tion, signal analysis, and perceptual considerations, it can be concluded from 
the preceding discussions that the modification capabilities of a representation 
hinge on its compactness. Also, for music synthesis with real-time modifica- 
tions, compact models are useful since they reduce the number of controls that 
must be manipulated by the musician. 


1.3 PARAMETRIC METHODS 


As discussed in Section 1.1, signal models have been traditionally categorized 
as parametric or nonparametric. In nonparametric methods, the model is con- 
structed using a rigid set of functions whereas in parametric methods the com- 
ponents are based on parameters derived by analyzing the signal. Examples 
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of parametric methods include source-filter and physical models [200, 217], 
linear predictive and prototype waveform speech coding [116, 136], granular 
analysis-synthesis of music [198], structured audio representations [234], and 
the sinusoidal model [149, 208]. The sinusoidal model is discussed at length in 
Chapter 2; granular synthesis is described in Section 1.5.4. The other models 
are discussed to varying extents throughout this text. 

The distinction between parametric and nonparametric methods is admit- 
tedly vague. For instance, the indices of the expansion functions in a nonpara- 
metric approach can be thought of as parameters, so the terminology is clearly 
somewhat inappropriate. The issue at hand is clarified in the next section, in 
which various nonparametric methods are reviewed, as well as in Chapter 2 
in the treatment of the short-time Fourier transform and the sinusoidal model, 
where a nonparametric filter bank method is revamped into a parametric model 
to facilitate signal modifications and reliable synthesis. The latter discussion 
indicates that the real issue is one of signal adaptivity rather than parametriza- 
tion, t.e. a description of a signal is most useful if the associated parameters are 
signal-adaptive. It should be noted that traditional signal-adaptive parametric 
representations are not generally capable of perfect reconstruction; this notion 
is revisited in Chapter 6, which presents signal-adaptive parametric models 
that can achieve perfect reconstruction in some cases. As will be discussed, 
such methods illustrate that the distinction between parametric and nonpara- 
metric is basically insubstantial. 


1.4 NONPARAMETRIC METHODS 


In contrast to parametric methods, nonparametric methods for signal expansion 
involve expansion functions that are in some sense rigid; they cannot necessarily 
be represented by physically meaningful parameters. Arbitrary basis expansions 
and overcomplete expansions belong to the class of nonparametric methods. 
The expansion functions in these cases are simply sets of vectors that span 
the signal space; they do not necessarily have an underlying structure. Note 
that these nonparametric expansions are tightly linked to the methods of linear 
algebra; the following discussion thus relies on matrix formulations. 


1.4.1 Basis Expansions 


For a vector space V of dimension N, a basis is a set of N linearly independent 
vectors {b;,bo,...,bw}. Linear independence implies that there is no nonzero 
solution {y,} to the equation 


N 
>> nbn = 0. (1.4) 
n=1 


Then, the matrix 


B= [b1 bo -:- bn], (1.5) 


10 ADAPTIVE SIGNAL MODELS 


whose columns are the basis vectors {b,,}, is invertible. Given the linear in- 
dependence property, it follows that any vector xz € V can be expressed as a 
unique linear combination of the form 


r= > Andy. (1.6) 
n=1 
In matrix notation, this can be written as 
z = Ba, (1.7) 
where a = [a1 Q2 3 ... An]?. The coefficients of the expansion are given by 
a = Bs. (1.8) 


Computation of a basis expansion can also be phrased without reference to the 
matrix inverse B—!; this approach is provided by the framework of biorthogonal 
bases, in which the expansion coefficients are evaluated by inner products with 
a second basis. After that discussion, the specific case of orthogonal bases is 
examined and some familiar examples from signal processing are considered. 
It should be noted that the discussion of basis expansions in this section does 
not rely on the norms of the basis vectors, but that no generality would be lost 
by restricting the basis vectors to having unit norm. In later considerations, it 
will indeed prove important that all the expansion functions have unit norm. 


Biorthogonal bases. Two bases {a1,G2,...,an} and {b1,b2,...,bw} are 
said to be a pair of biorthogonal bases if 


A" B = I, (1.9) 


where H denotes the conjugate transpose, J is the N x N identity matrix and 
the matrices A and B are given by 


A= [a ag :-°° an]| and B= (bh bo sce bn]. (1.10) 


Equation (1.9) can be equivalently expressed in terms of the basis vectors as 
the requirement that 


(a;,b;) = aj’b; = di — J]. (1.11) 


Such biorthogonal bases are also referred to as dual bases. 
Given the relationship in Equation (1.9), it is clear that 


AH = Br}, (1.12) 


Then, because the left inverse and right inverse of an invertible square matrix 
are the same [220], the biorthogonality constraint corresponds to 


ABH =I and BA" = I. (1.13) 
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This yields a pair of simple expressions for expanding a signal x with respect 
to the biorthogonal bases: 


¢ = ABM gz = BA"Mz 
N N (1.14) 
= S (bn Ban = S > (ans 2)bn. 
n=1 n=1 


This framework of biorthogonality leads to flexibility in the design of wavelet 
filter banks [238]. Furthermore, biorthogonality allows for independent evalua- 
tion of the expansion coefficients, which leads to fast algorithms for computing 
signal expansions. 


Orthogonal bases. An orthogonal basis is a special case of a biorthogonal 
basis in which the two biorthogonal or dual bases are identical; here, the or- 
thogonality constraint is 


(b;,6;) = d[t — J], (1.15) 
which can be expressed in matrix form as 
B°B=I = B® = B. (1.16) 


Strictly speaking, such bases are referred to as orthonormal bases [220]; how- 
ever, since most applications involve unit-norm basis functions, there has been a 
growing tendency in the literature to use the terms orthogonal and orthonormal 
interchangeably [238]. 

For an expansion in an orthogonal basis, the coefficients for a signal x are 
given by 


a = Bes = an = (bn,2), (1.17) 


so the expansion can be written as 


N 
z= > (bn, t)bn. (1.18) 


n=1 


As in the general biorthogonal case, the expansion coefficients can be indepen- 
dently evaluated. 


Examples of basis expansions. The following list summarily describes the 
wide variety of basis expansions that have been considered in the signal pro- 
cessing literature; supplementary details are supplied throughout the course of 
this book when needed: 


= The discrete Fourier transform (DFT) involves representing a signal in terms 
of sinusoids. For a discrete-time signal of length N, the expansion functions 
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are sinusoids of length N. Since the expansion functions do not have com- 
pact time support, i.e. none of the basis functions are time-localized, this 
representation is ineffective for modeling events with short duration. Local- 
ization can in some sense be achieved for the case of a purely periodic signal 
whose length is an integral multiple of the period M, for which a DFT of 
size M provides an exact representation. 


The short-time Fourier transform (STFT) is a modification of the DFT 
that has improved time resolution; it allows for time-localized representation 
of transient events and similarly enables DFT-based modeling of signals 
that are not periodic. The STFT is carried out by segmenting the signal 
into frames and carrying out a separate DFT for each short-duration frame. 
The expansion functions in this case are sinusoids that are time-limited to 
the signal frame, so the representation of dynamic signal behavior is more 
localized than in the general Fourier case. This is examined in greater detail 
in Chapter 2 in the treatment of the phase vocoder and the progression of 
ideas leading to the sinusoidal model. 


Block transforms. This is a general name for approaches in which a signal 
is segmented into blocks of length N and each segment is then decomposed 
in an N-dimensional basis. To achieve compression, the decompositions are 
quantized and thresholded, which leads to discontinuities in the reconstruc- 
tion, e.g. blockiness in images and frame-rate distortion artifacts in audio. 
This issue is somewhat resolved by lapped orthogonal transforms, in which 
the support of the basis functions extends beyond the block boundaries, 
which allows for a higher degree of smoothness in approximate reconstruc- 
tions (141, 142]. 


Critically sampled perfect reconstruction filter banks compute expansions of 
signals with respect to a biorthogonal basis related to the impulse responses 
of the analysis and synthesis filters [238]. This idea is fundamental to recent 
signal processing developments such as wavelets and wavelet packets. 


Wavelet packets correspond to iterations of two-channel filter banks [238]; 
such iterated filter banks are motivated by the observation that a perfect 
reconstruction model can be applied to the subband signals in a critically 
sampled perfect reconstruction filter bank without marring the reconstruc- 
tion. This leads to perfect reconstruction tree-structured filter banks and 
multiresolution capabilities as will be discussed in Section 1.5.1. Such trees 
can be made adaptive so that the filter bank configuration changes in time 
to adapt to changes in the input signal [102]; in such cases, however, the 
resulting model is no longer simply a basis expansion. This is discussed 
further in Section 1.4.2 and Chapter 3. 


The discrete wavelet transform is a special case of a wavelet packet where the 
two filters are generally highpass and lowpass and the iteration is carried out 
successively on the lowpass branch. This results in an octave-band filter bank 
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in which the sampling rate of a subband is proportional to its bandwidth. 
The resulting signal model is the wavelet decomposition , which consists 
of octave-band signal details plus a lowpass signal estimate given by the 
lowpass filter of the final iterated filter bank. This model generally provides 
significant compaction for images but not as much for audio [47, 205, 209, 
211, 212]. As will be seen in Chapter 5, in audio applications it is necessary 
to incorporate adaptivity in wavelet-based models to achieve transparent 
compaction [212]. 


Shortcomings of basis expansions. Basis expansions have a serious draw- 
back in that a given basis is not well-suited for decomposing a wide variety of 
signals. For any particular basis, it is trivial to provide examples for which the 
signal expansion is not compact; the uniqueness property of basis representa- 
tions implies that a signal with a noncompact expansion can be constructed 
by simply linearly combining the N basis vectors with N weights that are of 
comparable magnitude. 


Consider the well-known cases depicted in Figure 1.2. For the frequency- 
localized signal of Figure 1.2(a), the Fourier expansion shown in Figure 1.2(c) 
is appropriately sparse and indicates the important signal features; in contrast, 
an octave-band wavelet decomposition (Figure 1.2(e)) provides a poor rep- 
resentation because it is fundamentally unable to resolve multiple sinusoidal 
components in a single subband. For the time-localized signal of Figure 1.2(b), 
on the other hand, the Fourier representation of Figure 1.2(d) does not readily 
yield information about the basic signal structure; it cannot provide a com- 
pact model of a time-localized signal since none of the Fourier expansion func- 
tions are themselves time-localized. In this case, the wavelet transform (Figure 
1.2(f)) yields a more effective signal model. 


The shortcomings of basis expansions result from the attempt to represent 
arbitrary signals in terms of a very limited set of functions. Better represen- 
tations can be derived by using expansion functions that are signal-adaptive; 
signal adaptivity can be achieved via parametric approaches such as the sinu- 
soidal model [89, 149, 208], by using adaptive wavelet packets or best basis 
methods [37, 102, 191], or by choosing the expansion functions from an over- 
complete set of time-frequency functions or atoms [139]. Fundamentally, each 
of these models is an expansion based on an overcomplete set of vectors; this 
section focuses on the latter two, however, since these belong to the class of 
nonparametric methods. The term overcomplete means that the set or dictio- 
nary spans the signal space but includes more functions than is necessary to 
do so. Using a highly overcomplete dictionary of time-frequency atoms enables 
compact representation of a wide range of time-frequency behaviors; this de- 
pends however on choosing atoms from the dictionary that are appropriate for 
decomposing a given signal, t.e. the atoms are chosen in a signal-adaptive way. 
Basis expansions do not exhibit such signal adaptivity and as a result do not 
provide compact representations for arbitrary signals. According to the discus- 
sion in Section 1.2, this implies that basis expansions are not generally useful 
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Figure 1.2. Shortcomings of basis expansions. The frequency-localized signal in (a) has 
a compact Fourier transform (c) and a noncompact wavelet decomposition (e); the time- 
localized signal in (b) has a noncompact Fourier expansion (d) and a compact wavelet 
representation (f). 


for signal analysis, compression, denoising, or modification. The shortcomings 
y ? ? ? 
provide a motivation for considering overcomplete expansions. 


1.4.2. Overcomplete Expansions 


For a vector space V of dimension N, a complete set is a set of M vectors 
{d,,do,... ,d,¢} that contains a basis (M > N). The set is furthermore re- 
ferred to as overcomplete or redundant if in addition to a basis it also contains 
other distinct vectors (M > N). As will be seen, such redundancy leads to sig- 
nal adaptivity and compact representations; algebraically, it implies that there 
are nonzero solutions {7} to the equation 


>> mdm = 0. (1.19) 


There are thus an infinite number of possible expansions of the form 


M 
L = S- Amdm- (1.20) 
m=1 
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Namely, if {G@} is a solution to the above equation and {7m} is a solution to 
Equation (1.19), then {Gm + Ym} is also a solution: 


M M M M 
t= S (mt 4m)dm = >> dmdm + >> 4mdm = >) mdm. (1.21) 
m=1 m=1 m=1 


m=1 


In matrix notation, with 


D = [d; do -:- dy], (1.22) 
Equation (1.20) can be written as 
zt = Da, (1.23) 
where a = [a1 a2 a3 ... ay]?; the multiplicity of solutions can be interpreted 
in terms of the null space of D, which has nonzero dimension: 
zg = D(a4+%7) = Da+Dy = Da. (1.24) 


Since there are many possible overcomplete expansions, there are likewise a va- 
riety of metrics and methods for computing the expansions. The overcomplete 
case thus lacks the structure of the basis case, where the coefficients of the 
expansion can be derived using an inverse matrix computation or, equivalently, 
correlations with a biorthogonal basis. As a result, the signal modeling advan- 
tages of overcomplete expansions come at the cost of additional computation. 


Derivation of overcomplete expansions. In the general basis case, the 
coefficients of the expansion are given by a = B'z. For overcomplete expan- 
sions, one solution to Equation (1.20) can be found by using the singular value 
decomposition (SVD) of the dictionary matrix D to derive its pseudo-inverse 
D*. The coefficients a = Dtz provide a perfect model of the signal, but the 
model is not compact; this is because the pseudo-inverse framework finds the 
solution a with minimum two-norm, which is a poor metric for compaction 
(32, 220]. 

Given this information about the SVD, not to mention the computational 
cost of the SVD itself, it is necessary to consider other solution methods if a 
compact representation is desired. There are two distinct approaches. The 
first class of methods involves structuring the dictionary so that it contains 
many bases; for a given signal, the best basis is chosen from the dictionary. 
The second class of methods are more general in that they apply to arbitrary 
dictionaries with no particular structure; here, the algorithms are especially 
designed to derive compact expansions. These are discussed briefly below, after 
an introduction to general overcomplete sets. all of these issues surrounding 
overcomplete expansions are discussed at length in Chapter 6. 


Frames. An overcomplete set of vectors {d,,} is a frame if there exist two 
positive constants E > 0 and F < ov, referred to as frame bounds, such that 


Ellz|? < So \dms2)? < Fllell? (1.25) 
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for any vector z. If E = F, the set is referred to as a tight frame and a signal 
can be expanded in a form reminiscent of the basis case: 


z= Y(t; 2) (1.26) 


If the expansion vectors d,, have unit norm, & is a measure of the redun- 
dancy of the frame, namely M/N for a frame consisting of M vectors in an 
N-dimensional space. 

The tight frame expansion in Equation (1.26) is equivalent to the expansion 
given by the SVD pseudo-inverse; it has the minimum two-norm of all possible 
expansions and thus does not achieve compaction. A similar expansion for 
frames that are not tight can be formulated in terms of a dual frame; it is also 
connected to the SVD and does not lead to a sparse representation [41, 238]. 

More details on frames can be found in the literature [39, 41, 238]. It should 
simply be noted here that frames and oversampled filter banks are related in the 
same fashion as biorthogonal bases and critically sampled perfect reconstruction 
filter banks. Also, if a signal is to be reconstructed in a stable fashion from 
an expansion, meaning that bounded errors in the expansion coefficients lead 
to bounded errors in the reconstruction, it is necessary that the expansion set 
constitute a frame [238]. 

In the next two sections, two types of overcomplete expansions are consid- 
ered. These approaches are based on the theory of frames, but the discussions 
are phrased in terms of overcomplete dictionaries. It should be noted that such 
overcomplete dictionaries are indeed frames. 


Best basis methods. Best basis and adaptive wavelet packet methods, while 
not typically formalized in such a manner, can be interpreted as overcomplete 
expansions in which the dictionary contains a set of bases: 


For a given signal, the best basis from the dictionary is chosen for the expan- 
sion according to some metric such as the entropy of the coefficients [37], the 
mean-square error of a thresholded expansion, a denoising measure [50, 138], 
or rate-distortion considerations [102, 191]. In each of the cited approaches, 
the bases in the dictionary correspond to tree-structured filter banks; there are 
thus mathematical relationships between the various bases and the expansions 
in those bases. In these cases, choosing the best basis (or wavelet packet) is 
equivalent to choosing the best filter bank structure, possibly time-varying, for 
a given signal. More general best basis approaches, where the various bases are 
not intrinsically related, have not been widely explored. 


Arbitrary dictionaries. As will be seen in the discussion of time-frequency 
resolution in Section 1.5.2, best basis methods involving tree-structured filter 
banks, i.e. adaptive wavelet packets, still have certain limitations for signal 
modeling because of the underlying structure of the sets of bases. While that 


SIGNAL MODELS AND ANALYSIS-SYNTHESIS 17 


structure does provide for efficient computation, in the task of signal model- 
ing it becomes necessary to forego those computational advantages in order to 
provide for representation of arbitrary signal behavior. This suggestion leads 
to the more general approach of considering expansions in terms of arbitrary 
dictionaries and devising algorithms that find compact solutions. Such algo- 
rithms come in two forms: those that find exact solutions that maximize a 
compaction metric, either formally or heuristically [2, 32, 193], and those that 
find sparse approximate solutions that model the signal within some error tol- 
erance [42, 139, 160]. These two paradigms have the same fundamental goal, 
namely compact modeling, but the frameworks are considerably different; in 
either case, however, the expansion functions are chosen in a signal-adaptive 
fashion and the algorithms for choosing the functions are decidedly nonlinear. 

The various algorithms for deriving overcomplete expansions apply to ar- 
bitrary dictionaries. It is advantageous, however, if the dictionary elements 
can be parameterized in terms of relevant features such as time location, scale, 
and frequency. Such parametric structure is useful for signal coding since the 
dictionaries and expansion functions can be represented with simple parame- 
ter sets, and for signal analysis in that the parameters provide an immediate 
description of the signal behavior. It is especially worth noting that using a 
parametric dictionary provides a connection between overcomplete expansions 
and parametric models. 


1.4.3 Example: Haar Functions 


Basis expansions and overcomplete expansions can be easily compared using 
Haar functions; these are the earliest and simplest examples of wavelet bases 
[238]. For discrete-time signals with eight time points, the matrix corresponding 
to a Haar wavelet basis with two scales is 


1 1 T 
BAR 0 0 0 0 0 0 
0 0 Bw -# 0 0 0 0 
0 0 O 0 wm -a% 0 0 
Buea: = 0 0 0 0 0 0 V2 (2 9 (1.28) 
> > —-- --t 0 0 0 0 
2 2 2 2 
0 0 0 0 i ae re | 
l l 1 7 2 2 2 2 
og bg 9 0 Oo 
0 0 0 98 F F F 2 


where the basis consists of shifts by two and by four of the small scale and large 
scale Haar functions, respectively. The matrix is written in this transposed form 
to illustrate its relationship to the graphical description of the Haar basis given 
in Figure 1.3. An overcomplete Haar dictionary can be constructed by includ- 
ing all of the shifts by one of both small and large scales; the corresponding 
dictionary matrix is given in Figure 1.4. 

Figure 1.5(a) shows the signal z,[n] = b2, the second column of the Haar 
basis matrix. Figure 1.5(b) shows a similar signal, 22[n] = 2[n — 1], a circular 
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Figure 1.3. | The Haar basis with two scales (for C’). 
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Figure 1.4. The dictionary matrix for an overcomplete Haar set. 


time-shift of z,[n]. As shown in Figure 1.5(c), the decomposition of z,[n] in the 
Haar basis is compact since 2,([n] is actually in the basis; Figure 1.5(d), how- 
ever, indicates that the Haar basis decomposition of x2[n] is not compact and 
is indeed a much less sparse model than the pure time-domain signal represen- 
tation. Despite the strong relationship between the two signals, the transform 
representations are very different. The breakdown occurs in this particular ex- 
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Figure 1.5. Comparison of decompositions in the Haar basis of Equation (1.28) and the 
Haar dictionary of Equation (1.28). Decompositions of signals (a) and (b) appear in the 
column beneath the respective signal. The basis expansion in (c) is compact, while that in 
(d) provides a poor model. The overcomplete expansions in (e) and (f) are compact, but 
these cannot generally be computed by linear methods such as the SVD, which for this case 
yields the noncompact expansions given in (g) and (h). 


ample because the wavelet transform is not time-invariant; similar limitations 
apply to any basis expansion. Expansions using the overcomplete Haar dictio- 
nary are shown in Figures 1.5(e) and 1.5(f). Both of these models are compact. 
Noncompact overcomplete expansions derived using the SVD pseudo-inverse of 
Dyaar are Shown in Figures 1.5(g) and 1.5(h). Given the existence of the com- 
pact representations in Figures 1.5(e) and 1.5(f), the dispersion evident in the 
SVD signal models motivates the investigation of algorithms other than the 
SVD for deriving overcomplete expansions. Algorithms that derive compact 
expansions based on overcomplete dictionaries will be addressed in Chapter 6. 


1.4.4 Geometric Interpretation of Signal Expansions 


The linear algebra discussed above can be interpreted geometrically. Figure 1.6 
shows a comparison of various expansions in a two-dimensional vector space. 
The diagrams illustrate synthesis of the same signal using the vectors in an 
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Figure 1.6. Geometric interpretation of signal expansions for orthogonal and biorthogonal 
bases and an overcomplete dictionary or frame. 


orthogonal basis, a biorthogonal basis, and an overcomplete dictionary; issues 
related to analysis-synthesis and modification are discussed below. 


Analysis-synthesis. In each of the decompositions in Figure 1.6, the signal 
is reconstructed exactly as the sum of two expansion vectors. For the orthogo- 
nal basis, the expansion is unique and the expansion coefficients can be derived 
independently by simply projecting the signal onto the basis vectors. For the 
biorthogonal basis, the expansion vectors are not orthogonal; the expansion is 
still unique and the coefficients can still be independently evaluated, but the 
evaluation of the coefficients is done by projection onto a dual basis as described 
in Section 1.4.1. For the overcomplete frame, an infinite number of represen- 
tations are possible since the vectors in the frame are linearly dependent. One 
way to compute such an overcomplete expansion is to project the signal onto 
a dual frame; such methods, however, are related to the SVD and do not yield 
compact models [40]. As discussed in Section 1.4.2, there are a variety of other 
methods for deriving overcomplete expansions. In this example, it is clear that 
a compact model can be achieved by using the frame vector that is most highly 
correlated with the signal since the projection of the signal onto this vector 
captures most of the signal energy. This greedy approach, known as matching 
pursuit, is explored further in Chapter 6 for higher-dimensional cases. 


Modification. Modifications based on signal models involve either adjusting 
the expansion coefficients, the expansion functions, or both. It is desirable in 
any of these cases that the outcome of the modification be predictable. In this 
section, the case of coefficient modification is discussed since the vector inter- 
pretation provided above lends immediate insight; modifying the coefficients 
simply amounts to adjusting the lengths of the component vectors in the syn- 
thesis. In the orthogonal case, the independence of the components leads to 
a certain robustness for modifications since each projection can be modified 
independently; if the orthogonal axes correspond to perceptual features to be 
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adjusted, these features can be separately adjusted. In the biorthogonal case, 
to achieve the equivalent modification with respect to the orthogonal axes, the 
coupling between the projections must be taken into account. The most inter- 
esting caveat occurs in the frame case, however; because an overcomplete set is 
linearly dependent, some linear combinations of the frame vectors will add to 
zero. ‘This means that some modifications of the expansion coefficients, namely 
those that correspond to adding vectors in the null space of the dictionary ma- 
trix D, will have no effect on the reconstruction. This may seem to be at odds 
with the previous assertion that compact models are useful for modification, 
but this is not necessarily the case. If fundamental signal structures are iso- 
lated as in compact models, the corresponding coefficients and functions can 
be modified jointly to avoid such difficulties. In Chapter 2, such issues arise 
in the context of establishing constraints on the synthesis components to avoid 
distortion in the reconstruction. 


15 TIME-FREQUENCY DECOMPOSITIONS 


The domains of time and frequency are fundamental to signal descriptions; 
relatively recently, scale has been considered as another appropriate domain for 
signal analysis [238]. These various arenas, in addition to being mathematically 
cohesive, are well-rooted in physical and perceptual foundations; in some sense, 
the human perceptual experience can be well summarized in terms of when an 
event occurred (time), the duration of a given event (scale), and the rate of 
occurrence of events (frequency). 

In this section, the notion of joint time-frequency representation of a signal 
is explored; the basic idea is that a model should indicate the local time and 
frequency behavior of a signal. Some extent of time localization is necessary for 
real-world processing of signals; it is impractical to model a signal defined over 
all time, so some time-localized or sequential approach to processing is needed. 
Time localization is also important for modeling transients in nonstationary 
signals; furthermore, various transients may have highly variable durations, 
so scale localization is also desirable in signal modeling. Finally, frequency 
localization is of interest because of the relationship of frequency to pitch in 
audio signals, and because of the importance of frequency in understanding the 
behavior of linear systems. Given these motivations, signal models of the form 


a(n] ~ > axgi(nl (1.29) 


are of special interest when the expansion functions g;[n] are localized in time- 
frequency, since such expansions indicate the local time-frequency characteris- 
tics of a signal. Such cases, first elaborated by Gabor from both theoretical and 
psychoacoustic standpoints [74, 75], are referred to as time-frequency atomic 
decompositions; the localized functions g;[n] are time-frequency , fundamental 
particles which comprise natural signals. | 

Atomic decompositions lead naturally to graphical time-frequency represen- 
tations that are useful for signal analysis. Unfortunately, the resolution of any 


22 ADAPTIVE SIGNAL MODELS 


such analysis is fundamentally limited by physical principles [36, 100, 244]. This 
is the subject of Section 1.5.1, which discusses resolution tradeoffs between the 
various representation domains. With these tradeoffs in mind, methods for 
visualizing time-frequency models are considered in Sections 1.5.2 and 1.5.3. 
Lastly, Section 1.5.4 discusses applications of time-frequency atomic models in 
the field of computer music (110, 198, 226, 227]. 


1.5.1 Time-Frequency Atoms, Localization, and Multiresolution 


The time-frequency localization of any given atom is constrained by a resolu- 
tion limitation analogous to the Heisenberg uncertainty principle of quantum 
physics [74, 244]. In short, good frequency localization can only be achieved by 
analyzing over a long period of time, so it comes at the expense of poor time 
resolution; similarly, fine time resolution does not allow for accurate frequency 
resolution. Note that analysis over a long period of time involves considering 
large scale signal behavior, and that analysis over short periods of time involves 
examining small scale signal behavior; furthermore, it is sensible to analyze for 
low frequency components over large scales since such components by definition 
do not change rapidly in time, and likewise high frequency components should 
be analyzed over short scales. The point here is simply that scale is necessar- 
ily intertwined in any notion of time-frequency localization. These tradeoffs 
between localization in time, frequency, and scale are the motivation of the 
wavelet transform and multiresolution signal decompositions [137, 238]. 

The localization of an atom can be depicted by a tile on the time-frequency 
plane; a tile is simply a rectangular section centered at some (tg, wo) and having 
dimensions A; and A,, that describe where most of the atom’s energy lies [238]: 


A? = [ (t — to)? |x(t — to)|” dt (1.30) 
A2 = [- (w — wo)? |X (w — wo)? dw. (1.31) 


The uncertainty principle gives a lower bound on the product of these widths: 


A.A, > - (1.32) 
This uncertainty bound implies that there is a lower limit on the area of a 
time-frequency tile. It should be noted that non-rectangular tiles can also be 
formulated [13, 57, 58, 60, 143]. 

Within the limit of the resolution bound, many tile shapes are possible. 
These correspond to atoms ranging from impulses, which are narrow in time 
and broad in frequency, to sinusoids, which are broad in time and narrow in 
frequency; intermediate tile shapes basically correspond to modulated windows, 
i.e. time-windowed sinusoids. Various tiles are depicted in Figure 1.7. 

It should be noted that tiles with area close to the uncertainty bound are of 
primary interest; larger tiles do not provide the desired localized information 
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Figure 1.7. Tiles depicting the time-frequency localization of various expansion functions. 


about the signal. With this in mind, one approach to generating a set of 
expansion functions for signal modeling is to start with a mother tile of small 
area and to derive a corresponding family of tiles, each having the same area, 
by scaling the time and frequency widths by inverse factors and allowing for 
shifts in time. Mathematically, this is given by 


1 t—b 
GJa,b(t) — Va? ( a ) ’ (1.33) 
where g(t) is the mother function. The continuous-time wavelet transform is 
based on families of this nature; restricting the scales and time shifts to powers 
of two results in the standard discrete-time wavelet transform. Expansion using 
such a function set with variable scale leads to a multiresolution signal model, 
which is physically sensible given the time-frequency tradeoffs discussed earlier. 
Given a signal expansion in terms of a set of tiles, the signal can be modified 
by altering the underlying tiles. Time-shift, modulation, and scaling modifica- 
tions of tiles are depicted in Figure 1.8. One caveat to note is that synthesis 
difficulties may arise if the tiles are modified in such a way that the synthe- 
sis algorithm is not capable of constructing the new tiles, t.e. if the new tiles 
are not in the signal modeling dictionary. This occurs in basis expansions; for 
instance, in the case of critically sampled filter banks, arbitrary modifications 
of the subband signals yield undesirable aliasing artifacts. The enhancement 
of modification capabilities is thus another motivation for using overcomplete 
expansions instead of basis expansions. 

In this framework of tiles, the interpretation is that each expansion function 
in a decomposition analyzes the signal behavior in the time-frequency region 
indicated by its tile. Given that a signal may have energy anywhere in the 
time-frequency plane, the objective of adaptive modeling is to decide where to 
place tiles to capture the signal energy. Tile-based interpretations of various 
time-frequency signal models are discussed in the next section. 
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Figure 1.8. Modification of time-frequency tiles: translation, modulation, and scaling. 


1.5.2 Tilings of the Time-Frequency Plane 


Signal expansions can be interpreted in terms of time-frequency tiles. For 
instance, a basis expansion for an N-dimensional signal can be visualized as a 
set of N tiles that cover the time-frequency plane without any gaps or overlap. 
Examples of such time-frequency tilings are given in Figure 1.9; in visualizing 
an actual expansion, each tile is shaded to depict where the signal energy lies, 
i.e. to indicate the amplitude of the corresponding expansion function. 

As indicated in Figure 1.9, the tilings for Fourier and wavelet transforms 
have regular structures; this leads to a certain simplicity in the computation 
of the corresponding expansion. As discussed in Section 1.4.1, however, these 
basis expansions have limitations for representing arbitrary signals. For that 
reason, it is of interest to consider tilings with less restrictive structures. This 
is the idea in best basis and adaptive wavelet packet methods, where the best 
tiling for a particular signal is chosen; the best basis from a dictionary of bases 
is picked, according to some metric (37, 50, 102, 138, 191]. 

The time-varying tiling depicted in Figure 1.9 is intended as an example 
of an adaptive wavelet packet implemented with a signal-adaptive filter bank. 
This approach is suitable for a wide class of signals and allows for efficient com- 
putation, but the tiling is still restricted by the dyadic relationships between 
the scales, modulations, and time-shifts. The lack of complete generality arises 
because the tile sets under consideration cover the plane ezactly; this captures 
all of the signal energy, but not necessarily in a compact way. In the over- 
complete case, overlapping tiles are admitted into the signal decomposition; 
compact models can then be achieved by choosing a few such tiles that cover 
the regions in the time-frequency plane where the signal has significant energy. 
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Figure 1.9. Tilings of the time-frequency plane for a Fourier transform, short-time Fourier 
transform, wavelet transform, and wavelet packet. 


1.5.3 Quadratic Time-Frequency Representations 


Quadratic time-frequency representations or bilinear expansions have received 
considerable attention in the literature [35]. Fundamentally, such approaches 
are based on the Wigner-Ville distribution (WVD): 


co 
WVD{z}(w,T) = / r (x + 5) r (: - 5) e vt dt. (1.34) 
060 2 2 
Such representations provide improved resolution over linear expansions, but 
at the expense of the appearance of cross terms for signals with multiple com- 
ponents. For example, for a signal that consists of one linear chirp (sinusoid 
with linearly increasing frequency), the chirp is clearly identifiable in the distri- 
bution; for a signal consisting of two crossing chirps, the product in the integral 
yields cross terms that degrade the readability of the time-frequency distribu- 
tion [11, 111]. These terms can be smoothed out in various ways, but always 
with the countereffect of decreasing the resolution of the model [118, 169, 238]. 

Cross terms detract from the usefulness of a quadratic time-frequency rep- 
resentation. In some sense, the cross terms result in a noncompact model; they 
are extraneous elements in the representation that impede signal analysis. Even 
in cases where the cross terms are smoothed out, the loss of resolution corre- 
sponds to a loss of compaction, so this problem with quadratic time-frequency 


26 ADAPTIVE SIGNAL MODELS 


representations is quite general. One approach is to improve the resolution of 
a smoothed representation by a nonlinear post-processing method referred to 
as reallocation or reassignment, in which the focus of the distribution is succes- 
sively refined [11, 170, 171]. Another approach is to derive an atomic decom- 
position of the signal, perhaps approximate, and then define a time-frequency 
representation (TFR) of the signal as a weighted sum of the time-frequency 
representations of the atoms [139]: 


z[n] = > aigiln] (1.35) 


TFR{z}(w,7) = > ail? WVD{gi}(,7). (1.36) 


t 


There are no cross terms in distributions derived in this manner [139]; thus, 
another motivation for atomic time-frequency models is that they lead to clear 
visual descriptions of signal behavior. Of course, if the atomic decomposition 
is erroneous, the visual description will not be particularly useful. 


1.5.4 Granular Synthesis 


Granular synthesis is a technique in computer music which involves accumulat- 
ing a large number of basic sonic components or grains to create a substantial 
acoustic event [198]. This approach is based on a theory of sound and per- 
ception that was first proposed by Gabor [75]; he suggested that any sound 
could be described using a quantum representation where each acoustic quan- 
tum or grain corresponds to a local time-frequency component of the sound. 
Such descriptions are psychoacoustically appropriate given the time-frequency 
resolution tradeoffs and limitations observed in the auditory system. 

In early efforts in granular music synthesis, artificial sounds were composed 
by combining thousands of parameterized grains [198]. Individual grains were 
generated according to synthetic parameters describing both time-domain and 
frequency-domain characteristics, e.g. time location, duration, envelope shape, 
and modulation. This method was restricted to the synthesis of artificial 
sounds, however, because the method did not have an accompanying analysis 
capable of deriving granular decompositions of existing natural sounds [226]. 

Simple analysis techniques for deriving grains from real sounds have been 
proposed in the literature [110, 227]. The objective of such granulation ap- 
proaches is to derive a representation of natural sounds that enables modifi- 
cations such as time-scaling or pitch-shifting prior to resynthesis. The basic 
idea in these methods is to extract grains by applying time-domain windows 
to the signal. Each windowed portion of the signal is treated as a grain, and 
parameterized by its window function and time location. These grains can be 
repositioned in time or resampled in various ways to achieve desirable modi- 
fications [110, 227]. Similar ideas have been explored in the speech process- 
ing community; one example is the pitch-synchronous (PSOLA) overlap-add 
method for signal modification [116, 158]. 
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Grains derived by the time-windowing process can be interpreted as signal- 
dependent expansion functions. If the grains are chosen judiciously, e.g. to 
correspond to pitch periods of a pseudo-periodic sound, then the model cap- 
tures important signal structures. Because of the complicated time structure of 
natural sounds, however, grains derived in this manner are generally difficult to 
represent efficiently and are thus not particularly applicable to signal coding. 
Nevertheless, this method is of interest because of its modification capabilities 
and its underlying signal adaptivity. 

Time-windowed signal components, while useful for modifications, are dis- 
parate from the fundamental acoustic quanta suggested by Gabor; simple win- 
dowing is not an appropriate analysis for Gabor’s time-frequency representa- 
tion. With that as motivation, the three distinct signal models in this book are 
interpreted as granulation approaches: the sinusoidal model, pitch-synchronous 
expansions, and atomic models based on overcomplete time-frequency dictio- 
naries can all be viewed in this light. These models provide time-frequency 
grains for additive synthesis of natural signals, and these grains can generally 
be thought of as tiles on the time-frequency plane. 


1.6 OVERVIEW 


This book is concerned with signal models of the form given in Equation (1.1), 
namely additive expansions. The models in Chapters 2 through 5 can be clas- 
sified as parametric approaches. Chapter 6 discusses a method that would be 
traditionally classified as nonparametric but which actually demonstrates that 
the distinction between the two types of models is artificial. 


1.6.1 Outline 


The contents of this book are as follows. First, Chapter 2 discusses the si- 
nusoidal model, in which the expansion functions are time-evolving sinusoids. 
This approach is presented as an evolution of the nonparametric short-time 
Fourier transform into the the parametric sinusoidal model; the chapter in- 
cludes detailed treatments of the STFT and analysis-synthesis methods for the 
sinusoidal model. Chapter 3 provides an interpretation of the sinusoidal model 
in terms of time-frequency atoms, which motivates the consideration of mul- 
tiresolution extensions of the model for accurately representing localized signal 
behavior. Chapter 4 discusses the sinusoidal analysis-synthesis residual and 
presents a perceptually motivated model for this residual. Chapter 5 exam- 
ines pitch-synchronous sinusoidal models and wavelet transforms; estimation 
of the pitch parameter is shown to provide a useful avenue for improving the 
signal representation in both cases. In Chapter 6, overcomplete expansions 
are revisited; signal modeling is interpreted as an inverse problem and connec- 
tions between structured overcomplete expansions and parametric methods are 
considered. The chapter discusses the matching pursuit algorithm for comput- 
ing overcomplete expansions, and considers overcomplete dictionaries based on 
damped sinusoids, for which expansions can be computed using simple recur- 
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sive filter banks. Finally, Chapter 7 summarizes the book and presents various 
concluding remarks about adaptive signal models and related algorithms. 


1.6.2. Themes 


This book has several underlying and recurring themes. In a sense, this text is 
simply about the relationships between these themes, some of which have been 
discussed in preliminary form in [88, 89]. 


Filter banks and multiresolution. Filter bank theory and design appear in 
several places in this book. Primarily, the book deals with the interpretation of 
filter banks as analysis-synthesis structures for signal modeling. The connection 
between multirate filter banks and multiresolution signal modeling is explored. 


Signal-adaptive representations. Each of the signal models or represen- 
tations discussed in this book exhibits signal adaptivity. In the sinusoidal and 
pitch-synchronous models, the decompositions are signal-adaptive in that the 
expansion functions are generated based on data extracted from the signal. 
In the overcomplete expansions, the models are adaptive in that the expan- 
sion functions for the signal decomposition are chosen from the dictionary in a 
signal-dependent fashion. 


Parametric models. The expansion functions in the sinusoidal and pitch- 
synchronous models are generated based on parameters derived by the signal 
analysis. Such parametric expansions, as discussed in Section 1.2, are useful for 
characterization, compression, and modification of signals. Overcomplete ex- 
pansions can be similarly parametric in nature if the underlying dictionary has 
a meaningful parametric structure. In such cases, the traditional distinction 
between parametric and nonparametric methods evaporates, and the overcom- 
plete expansion provides a highly useful signal model. 


Nonlinear analysis. In each model, the model estimation is inherently non- 
linear. The sinusoidal and pitch-synchronous models rely on nonlinear parame- 
ter estimation and interpolation. The matching pursuit is inherently nonlinear 
in the way it selects the expansion functions from the overcomplete dictionary; 
it overcomes the inadequacies of linear methods such as the SVD while provid- 
ing for successive refinement and compact sparse approximations. It has been 
argued that overcompleteness, when coupled with a nonlinear analysis, yields 
a signal-adaptive representation, so these notions are tightly coupled [42, 94]. 


Atomic models. Finally, all of the models in this book can be interpreted 
in terms of localized time-frequency atoms or grains. The notion of time- 
frequency decompositions has been discussed at length in this introduction, 
and will continue to play a major role throughout the remainder of this book. 
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...in nature there are few sharp lines... 


— A.R.Ammons, “Corsons Inlet” 


The sinusoidal model has been widely applied to speech coding and pro- 
cessing [6, 113, 125, 135, 144, 147, 149, 185, 188] and audio analysis—modification-— 
synthesis [62, 76, 182, 183, 201, 207, 208]. This chapter discusses the sinusoidal 
model, including analysis and synthesis techniques, reconstruction artifacts, 
and modification capabilities enabled by the parametric nature of the model. 
Time-domain and frequency-domain synthesis methods are examined. A thor- 
ough review of the short-time Fourier transform is included as an introduction 
to the discussion of the sinusoidal model. 


2.1 THE SINUSOIDAL SIGNAL MODEL 


A variety of sinusoidal modeling techniques have been explored in the litera- 
ture (6, 7, 76, 125, 144, 147, 149, 201, 208], These methods share fundamental 
common points, but also have substantial but sometimes subtle differences. 
For the sake of simplicity, this treatment adheres primarily to the approaches 
presented in the early literature on sinusoidal modeling [149, 208], and not on 
the many variations that have since been proposed [62, 125, 144]; comments 
on some other techniques such as [46, 76] are included, but these inclusions 
are limited to techniques that are directly concerned with the modeling issues 
at hand. It should be noted that the issues to be discussed herein apply to 
sinusoidal modeling in general; their relevance is not limited by the adherence 
to the particular methods of [149, 208]. Also, note that the method of [201] is 
discussed at length in the section on frequency-domain synthesis, where various 
refinements are proposed. 
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2.1.1 The Sum-of-Partials Model 


In sinusoidal modeling, a discrete-time signal z[n] is modeled as a sum of evolv- 
ing sinusoids called partials: 


Q[n] Q[n] 
o{n] = @{n] = Spain] = D> Aq{n| cos O,{n], (2.1) 


where Q[n] is the number of partials at time n. The q-th partial p,[n] has 
time-varying amplitude A,|n] and total phase ©,[n], which describes both its 
frequency evolution and phase offset. The additive components in the model 
are thus simply parameterized by amplitude and frequency functions or tracks. 
These tracks are assumed to vary on a time scale substantially longer than the 
sampling period, meaning that the parameter tracks can be reliably estimated 
at a subsampled rate. This assumption of slow variation leads to compaction 
in the representation. 

The model in Equation (2.1) is reminiscent of the familiar Fourier series; 
the notion in Fourier series methods is that a periodic signal can be exactly 
represented by a sum of fixed harmonically related sinusoids. Purely periodic 
signals, however, are a mathematical abstract. Real-world oscillatory signals 
such as a musical note tend to be pseudo-periodic; they exhibit variations from 
period to period. The sinusoidal model is thus useful for modeling natural 
signals since it generalizes the Fourier series in the sense that the constituent 
sinusoids are allowed to evolve in time according to the signal behavior. Of 
course, the sinusoidal model is not limited to applications involving pseudo- 
periodic signals; models tailored specifically for pseudo-periodic signals will be 
discussed in Chapter 5. 

Fundamentally, the sinusoidal model is useful because the parameters cap- 
ture musically salient time-frequency characteristics such as spectral shape, 
harmonic structure, and loudness. Since it describes the primary musical infor- 
mation about the signal in a simple, compact form, the parameterization pro- 
vides both a reasonable coding representation and a framework for carrying out 
desirable modifications such as pitch-shifting, time-scaling, and a wide variety 
of spectral transformations such as cross-synthesis (62, 185, 188, 201, 208, 242]. 


2.1.2 Deterministic-plus-Stochastic Decomposition 


The approximation symbol in Equation (2.1) is included to imply that the sum- 
of-partials model does not provide an exact reconstruction of the signal. Since 
a sum of slowly-varying sinusoids is ineffective for modeling either impulsive 
events or highly uncorrelated noise, the sinusoidal model is not well-suited 
for representing broadband processes. As a result, the sinusoidal analysis- 
synthesis residual consists of such processes, which correspond to musically 
important signal features such as the colored breath noise in a flute sound or 
the impulsive mallet strikes of a marimba. Since these features are important 
for high-fidelity synthesis, an additional component is often included in the 
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signal model to account for broadband processes: 
z[n] = 2[n] + rin] = d[n] + s(n]. (2.2) 


The resultant deterministic-plus-stochastic decomposition was introduced in 
(207, 208] and has been discussed in several later efforts (83, 98]. Using this 
terminology brings up salient issues about the theoretical distinction between 
deterministic and stochastic processes; to avoid such pitfalls, the following anal- 
ogy is drawn: the deterministic part of the decomposition is likened to the sum- 
of-partials of Equation (2.1) and the stochastic part is similarly likened to the 
residual of the sinusoidal analysis-synthesis process, leading to a reconstruction- 
plus-residual decomposition. This method can then be considered in light of 
the conceptual framework of Chapter 1. The sinusoidal analysis-synthesis is 
described in Sections 2.3 to 2.6; from that, the characteristics of the residual 
are inferred, which leads to the residual modeling approach of Chapter 4. 


2.2. THE SHORT-TIME FOURIER TRANSFORM 


Sinusoidal modeling can be viewed in a historical context as an evolution of 
short-time Fourier transform (STFT) and phase vocoder techniques. These 
methods and variations were developed and explored in a number of references 
[4, 5, 38, 49, 63, 96, 156, 172, 173]. In this treatment, the shortcomings of the 
STFT and phase vocoder serve to motivate the general sinusoidal model. 


2.2.1 Formulation of the STFT 


In this section, the STFT is defined and interpreted; it is shown that slightly 
revising the traditional definition leads to an alternative filter bank interpre- 
tation of the STFT that is appropriate for signal modeling. Perfect recon- 
struction constraints for such STFT filter banks are derived. In the literature, 
z-transform and matrix representations have been shown to be useful in an- 
alyzing the properties of such filter banks [232, 237, 238]. Here, for the sake 
of brevity, these methods are not explored; the STFT filter banks are treated 
using time-domain considerations. 


Definition of the short-time Fourier transform. The short-time Fourier 
transform was described conceptually in Sections 1.4.1 and 1.5.1; basically, the 
goal of the STFT is to derive a time-localized representation of the frequency- 
domain behavior of a signal. The STFT is carried out by applying a sliding 
time window to the signal; this process isolates time-localized regions of the 
signal, which are each then analyzed using a discrete Fourier transform (DFT). 
Mathematically, this is given by 


N-1 
X{k,n| = > w[m]z[n + mjeFee™ (2.3) 
m=0 
where the DFT is of size K, meaning that w, = 2xk/K, and w[m] is a time- 
domain window with zero value outside the interval [0, N — 1]; windows with 
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infinite time support have been discussed in the literature, but these will not be 
considered here [172, 224]. In the early literature on time-frequency transforms, 
signal analysis-synthesis based on Gaussian windows was proposed by Gabor 
(74, 75]; given this historical foundation, the STFT is sometimes referred to as 
a Gabor transform [238]. 

The transform in Equation (2.3) can be expressed in a subsampled form 
which will be useful later: 


N-1 
X(k,i) = S w(m)z[m + iLle-i4*™, (2.4) 


m=0 


where L is the analysis stride, the time distance between successive applications 
of the window to the data. The notation is as follows: brackets around the 
arguments are used to indicate a nonsubsampled STFT such as in X[k, nl], 
while parentheses are used to indicate subsampling as in X(k,7), which is used 
in lieu of X|k,iL] for the sake of neatness. Admittedly, the notation X (k,7) is 
somewhat loose in that it does not incorporate the hop size L, but to account 
for this difficulty the hop size of any subsampled STFTs under consideration 
will be indicated explicitly in the text. The subsampled form of the STFT is of 
interest since it allows for a reduction in the computational cost of the signal 
analysis and in the amount of data in the representation; it also affects the 
properties of the model and the reconstruction as will be demonstrated. 

The definition of the STFT given in Equations (2.3) and (2.4) differ from 
that in traditional references (5, 38, 172, 173], where the transform is given as 


X[k,n] = 3 w[n — mja[m]eFee™ (2.5) 
nN 1 
> w[n — m]z[m]e~"*™, (2.6) 


m=n 


or in subsampled form as 


iL+N-1 
X(k,i) = S [iL — m]z[m]eF"*™, (2.7) 


m=tL 


where w[m] is again a time-localized window. The range of m in the sum, 
and hence the support of the window w([n], is defined here in such a way that 
the transforms X[k,n] and X[k,n] refer to the same N-point segment of the 
signal and can thus be compared; it should be noted that in some treatments 
the STFT is expressed as in Equation (2.5) but without time-reversal of the 
window (232]. It will be shown that this reversal of the time index affects the 
interpretation of the transform as a filter bank; more importantly, however, the 
interpretation is affected by the time reference of the expansion functions. This 
latter issue is discussed below. 
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The time reference of the STFT. In the formulation of the STFT in 
Equations (2.5) and (2.7), the expansion functions are sinusoids whose time 
reference is in some sense absolute; for different windowed signal segments, the 
expansion functions have the same time reference, m = 0, the time origin of the 
signal z[m]. On the other hand, in Equations (2.3) and (2.4) the time origin of 
the expansion functions is instead the starting point of the signal segment in 
question; the phases of the expansion coefficients for a segment refer to the time 
start of that particular segment. Note that the STFT can also be formulated 
such that the phase is referenced to the center of the time window, which 
is desirable in some cases [129]; this extension will play a role in sinusoidal 
modeling, but such phase-centering will not be used in the development of the 
STFT because of the slight complications it introduces. 

The two formulations of the STFT have different ramifications for signal 
modeling; this difference can be seen by relating the two definitions [38, 190]; 


_ n+N—1 
X[k,n] = > w[n — m]z[mle~I*™ 
m=n 
N-1 . 
= > ti[—m]z[n + mle~Jee m+") (change of index) 
m=0 
Nw . (2.8) 
= e JRn > w[—m]z[n + mje7?*™ 
° N=1 ° 
= e Jnr S- w([mjz[n + mje 7**™ — (t[m] = w[—m)) 
m=0 


X[k, n] e~Jv—n X[k, nl]. 


This formulation leads to two simple relationships: 


X[k,n] = X[k,njei*" (2.9) 
X[k, n]| |X [k,n]. (2.10) 


The first expression affects the interpretation of the STFT as a filter bank; the 
time signal X[k, n] is a modulated version of the baseband envelope X[k, n], so 
the corresponding filter banks for the two cases have different structures. The 
second expression affects the interpretation of the STFT as a series of time- 
localized spectra; the short-time magnitude spectra are the same in the two 
formulations. For magnitude considerations, then, the two cases are equivalent. 
With respect to phase, however, the approaches differ in that X[k,n] provides 
a local phase. An estimate of the local phase of each partial is important for 
building the signal model, so X[k, n] is more useful for sinusoidal modeling than 
X[k,n]. This will become more apparent in Sections 2.3 and 2.4. 


Interpretations of the STFT. In [5, 190] and other traditional treatments 
of the STFT, two interpretations are considered. First, the STFT can be 
viewed as a series of time-localized spectra; notationally, this corresponds to 
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Time-localized 
spectrum 


--» Filter bank 


Frequency 


Figure 2.1. Interpretations of the short-time Fourier transform as a series of time-localized 
spectra (vertical) and as a bank of bandpass filters (horizontal). 


interpreting X[k,n] as a function of frequency k for a fixed n. Given that 
the derivation of a time-localized spectral representation was indeed the initial 
motivation of the STFT, the novelty lies in the second interpretation, where 
the STFT is viewed as a bank of bandpass filters. Here, X[(k,n] is thought of 
as a function of time n for a fixed frequency k; it is simply the output of the 
k-th filter in the STFT filter bank. A depiction of these interpretations based 
on the time-frequency tiling of the STFT is given in Figure 2.1; indeed, the 
notion of a tiling unifies the two perspectives. 

The two interpretations are discussed in the following sections; as will be 
seen, each interpretation provides a framework for signal reconstruction and 
each framework yields a perfect reconstruction constraint. In the traditional 
formulation of the STFT, the reconstruction constraints are different for the 
two interpretations, but can be related by duality [5]. In the phase-localized for- 
mulation of Equations (2.3) and (2.4), the two frameworks immediately yield 
the same perfect reconstruction condition; this is not particularly surprising 
since the representation of the STFT as a time-frequency tiling suggests that a 
distinction between the two interpretations is indeed artificial. The mathemat- 
ical details related to these issues are developed below; also, the differences in 
the signal models corresponding to the two STFT formulations are discussed. 


The STFT as a series of time-localized spectra. If the STFT is inter- 
preted as a series of time-localized spectra, the accompanying reconstruction 
framework involves taking an inverse DFT (IDFT) of each local spectrum, and 
then connecting the resulting signal frames to synthesize the signal. If K > N, 
the IDFT simply returns the windowed signal segment: 


IDFT{X (k, t)} w([m|jz[m+iL] for 0<m<N-1 


w[n-iL|z[n] for iL<n<iL+N-—1, 


(2.11) 
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where the second step is carried out to simplify the upcoming formulation. 
Regarding the size of the DFT, when K > N the DFT is oversampled; 
this frequency-domain oversampling results in time-limited interpolation of the 
spectrum, which is analogous to the bandlimited interpolation that is charac- 
teristic of time-domain oversampling. In the undersampled case K < N, time- 
domain aliasing is introduced, so the formulation must be revised to provide 
for time-domain aliasing cancellation [178], which will be discussed in Section 
2.2.2. To avoid such difficulties, the condition K > N is imposed at this point. 

If the DFT is large enough that no aliasing occurs, reconstruction can be 
simply carried out by an overlap-add (OLA) process, possibly with a synthesis 
window [38, 172, 173], which will be denoted by v[n]: 


g[n] = » w[n — iL|v[n —iL]2[n]. (2.12) 


Perfect reconstruction is thus achieved if the windows w|n] and v[n] satisfy the 
constraint 


» w[n —iLly[n —iL] = 1 (2.13) 


or some other constant. This constraint is similar to but somewhat more general 
than the perfect reconstruction constraints given in [5, 38, 172, 173, 190]. Note 
that throughout this section the analysis and synthesis windows will both be 
assumed to be real-valued. 

In cases where v[n] is not explicitly specified, the synthesis window is equiv- 
alently a rectangular window covering the same time span as w[n]. For a 
rectangular synthesis window, the constraint in Equation (2.13) becomes 


> vln — iL] = 1. (2.14) 


The construction of windows with this property has been explored in the lit- 
erature; a variety of perfect reconstruction windows have been proposed, for 
example rectangular and triangular windows and the Blackman-Harris family, 
which includes the familiar Hanning and Hamming windows [99, 163]. These 
are also referred to as windows with the overlap-add property , and will be 
denoted by wpa[n] in the following derivations. Note that any window function 
satisfies the condition in Equation (2.14) in the nonsubsampled case L = 1; 
note also that in the case L = N the only window that has the overlap-add 
property is a rectangular window of length N. Functions that satisfy Equation 
(2.14) are also of interest for digital communication; the Nyquist criterion for 
avoiding intersymbol interference corresponds to a frequency-domain overlap- 
add property [126]. 

Windows that satisfy Equation (2.13) can be designed in a number of ways. 
The methods to be discussed rely on using familiar windows that satisfy (2.14) 
to jointly construct analysis and synthesis windows which satisfy (2.13); vari- 
ous analysis-synthesis window pairs designed in this way exhibit computational 
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and modeling advantages [38, 172, 173, 201]. In one design approach, comple- 
mentary powers of a perfect reconstruction window provide the analysis and 


synthesis windows: 


S| wer[n — iL] =1l1=>> S— (wealn — iL])° (wer[n —iL])’~° = 1 (2.15) 


(Wer [n])° 


Synthesis window v[n] = (wer in})'~° 


Analysis window  w/[n] 
=> (2.16) 


The case c = 5 where the analysis and synthesis windows are equivalent, has 
been of some interest because of its symmetry. A second approach is as follows; 
given a perfect reconstruction window wp,(n] and an arbitrary window b[n] that 
is strictly nonzero over the time support of wp,[n], the overlap-add property 
can be rephrased as follows: 


2, wenln — iL] =1 => 2, wen|n — iL] eS =) = 1 (2.17) 


= Doin inj (MARSA) <1 easy 


Analysis window w[n] = On] 
Wer(n] (2.19) 


Synthesis window v[n] = bin 


Noting the form of the synthesis window in the overlap-add sum in Equation 
(2.18), the restriction that b[n] be strictly nonzero can be relaxed slightly: b[n] 
can be zero where wp,(n] is also zero; if the synthesis window v|n] is defined 
to be zero at those points, the perfect reconstruction condition is met. This 
latter design method will come into play in the frequency-domain sinusoidal 
synthesizer to be discussed in Section 2.5. 


The STFT as a heterodyne filter bank. In [5, 172, 173, 232], where the 
STFT is defined as in Equation (2.5) and the expansion functions have an 
absolute time reference, the transform can be interpreted as a filter bank with 
a heterodyne structure. Starting with Equation (2.5), 


n+N—1 
X(k,n] = S- ti[n — m]a[m]eF4*™, (2.20) 


m=n 
the substitution 


z,[m] = 2[m]le7’*™ (2.21) 
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yields an expression that is immediately recognizable as a convolution: 
n+N—-1 
X[k,n] = > w[n—-mJz,[m]. (2.22) 
mn 
The filter w[n] is typically lowpass; it thus extracts the baseband spectrum of 
z,[m]. According to the modulation relationship defined in Equation (2.21), 
z,[(m] is a version of z[m] that has been modulated down by w,; thus, the 
baseband spectrum of z;|m] corresponds to the spectrum of z[m] in the neigh- 
borhood of frequency w,. In this way, the k-th branch of the STFT filter bank 
extracts information about the signal in a frequency band around uw, = 27k/K. 
In the time domain, X[k,n] can be interpreted as the amplitude envelope 
of a sinusoid with frequency w,. This perspective leads to the framework for 
signal reconstruction based on the filter bank interpretation of the STFT; this 
framework is known as the filter bank summation (FBS) method. The idea is 
straightforward: the signal can be reconstructed by modulating each of these 
envelopes to the appropriate frequency and summing the resulting signals. This 
construction is given by 


&[n] = )_ X[k, nje™”, (2.23) 
k 


which can be manipulated to yield perfect reconstruction conditions [5, 190]; 
this nonsubsampled case is not very general, however, so these constraints will 
not be derived here. Rather, Equation (2.23) is given to indicate the similarity 
of the STFT signal model and the sinusoidal model. Each of the components 
in the sum of Equation (2.23) can be likened to a partial; the function X[k, n] 
is then the time-varying amplitude of the k-th partial. Note that in the phase- 
localized STFT formulated in Equation (2.3), the corresponding reconstruction 
formula is 


z[n| = X[k, , (2.24) 
k 


where the STFT X[k, n] corresponds to a partial at frequency w, rather than 
its amplitude envelope. 

Figure 2.2 depicts one branch of a heterodyne STFT filter bank and provides 
an equivalent structure based on modulated filters [232]. Mathematically, the 
equivalence is straightforward: 


Xi[k,n] = © wn -m] (x[mje*™) 


e Juan S (wn _ mle?” (nm) ) a[m] _ xX ik, nl]. (2.25) 


Given the relationship in Equation (2.9), namely that X[k,n] = X[k, njei”*", it 
is clear that X[k, n] is the immediate output of the modulated filter w[n]e7“*” 
without the ensuing modulation to baseband. This observation, which is indi- 
cated in Figure 2.2, serves as motivation for interpreting the STFT of Equation 
(2.3) as a modulated filter bank. 
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Figure 2.2. One channel of a heterodyne filter bank for evaluating the STFT X[k, n] 
defined in Equation (2.5). The two structures are equivalent as indicated in Equation 
(2.25). The STFT X[k,n] as defined in Equation (2.3) is an intermediate signal in the 
second structure. 


The STFT as a modulated filter bank. Modulated filter banks, in which 
the filters are modulated versions of a prototype lowpass filter, have been of 
considerable interest in the recent literature [141, 232, 238]. In part, this in- 
terest has stemmed from the realization that the STFT can be implemented 
with a modulated filter bank structure. Indeed, the STFT of Equation (2.3) 
corresponds exactly to a modulated filter bank of the general form shown in 
Figure 2.3. This filter bank is markedly different from the heterodyne structure 
in that the subband signals are not amplitude envelopes but are actual signal 
components that can be likened to partials, which will prove conceptually useful 
in extending the STFT to the general sinusoidal model. 

The modulated filter bank of Figure 2.3 implements an STFT analysis- 
synthesis if the filters are defined as 


h(n] w(—n]erve” (2.26) 


ge{n] = v[nje%*”. (2.27) 


Note the time-reversal of the window w[n] in the definition of the analysis filter 
h,[n]; the time-reversal appears because the window in Equation (2.3) is not 
thought of in a time-reversed fashion as in Equation (2.5). Using the notation 
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Analysis Synthesis 
filter bank filter bank 
hy[n] = wl-njei*" gr[n] = v[njerv*” 


Figure 2.3. Interpretation of the short-time Fourier transform as a modulated filter bank. 
The subband signals are labeled to match the formulation in the text. 


in Figure 2.3, the subband signals in the STFT filter bank are given by 


crn] = > — he[m)z[n —m] (2.28) 
— 3 w[—m]z[n — mleiv*™ (2.29) 
= > w(m]z[n + mle~ie*™ (2.30) 
= X[k, n| (2.31) 
yx[t] = 2, [tL] (2.32) 
= X(k,i) (2.33) 
ze[n] = zx[n] > d[n - iL], (2.34) 


where the last expression simply describes the effect of successive downsampling 
and upsampling on the signal 2;[n]. Again, note that the subband signals are 
essentially the partials of the signal model, and are not amplitude envelopes as 
in the heterodyne structure of the traditional STFT filter bank. 

In the framework of [5], namely the STFT as given in Equation (2.5), the 
overlap-add and filter bank summation synthesis methods lead to different per- 
fect reconstruction constraints which can be interpreted as duals. For the phase- 
localized definition of the STFT, on the other hand, the OLA and FBS methods 
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lead directly to the same constraint: 


K—1 
z[n] = 2, #x(n (2.35) 


y (zi[n] * ge[n (2.36) 
> >, gelllzn[n — Y (2.37) 
y y v[l]ei”*!' 2, [n — 1] “ d{n —1 -iL] (2.38) 
ELE Lowe n — 1+ mjei¥*—-™ Sin — 1 — iL]. (2.39) 


For w, = 27k/K, the summation over the frequency index k can be expressed 
as 


> eiva(l-m) — K d 6[l -m+rK]. (2.40) 
k=0 


If |! —m|< K for all possible combinations of |! and m, then the only relevant 
term in the right-hand sum is for r = 0, in which case the equation simplifies 
to 

K-1 

5 eto) = K6ll — ml. (2.41) 

k=0 


The restriction on the values of |! and m corresponds to the constraint K > N 
discussed in the treatment of overlap-add synthesis; namely, time-domain alias- 
ing is introduced if / and m do not meet this criterion. Further consideration 
of time-domain aliasing is deferred until Section 2.2.2. 

As in the discussion of OLA synthesis, it is assumed at this point that time- 
domain aliasing is not introduced. Then, the FBS reconstruction formula can 
be rewritten as 


sn) = KY_YS vfl]6| n—1~ it] ) wim) In -—l+m]6{l-—m] (2.42) 
i i 
Kz[n] >>> ull] v{l]jd{n —1 — iL] (2.43) 
l i 


x(n] > w[n — iL]v[n — iL}. (2.44) 


lI 


The design constraint for perfect reconstruction, within a gain term, is then 
exactly the same as in the OLA synthesis approach: 


dvr ~iL]v[n —iL] = (2.45) 
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Because of this equivalence, the analysis-synthesis window pairs described ear- 
lier can be used as prototype functions for perfect reconstruction modulated 
filter banks. 

Note that if L > 1, the synthesis filter bank interpolates the subband signals. 
In the nonsubsampled case LZ = 1, when no interpolation is needed, perfect 
reconstruction can be achieved with any analysis-synthesis window pair for 
which >, w[n]v[n] # 0. For example, the synthesis can be performed with the 
trivial filter bank g;,[n] = 6[n] if the analysis window satisfies the constraint 


> w[n—i] = 1, (2.46) 


1 


which indeed holds for any window, within a gain term. The generality of 
this constraint is an example of the design flexibility that results from using 
oversampled or overcomplete approaches [24, 39, 40]. 

At this point it is useful to recall the earlier formulation of the STFT signal 
model: 


z[n| = > X [k,n] = > X[k, njem*". (2.47) 
k k 


In the modulated filter bank case, the subband signals can be viewed as the 
partials of the sinusoidal model; in the heterodyne case, the subband signals are 
instead lowpass amplitude envelopes of the partials. Furthermore, the phase 
of X[k,n] is the phase of the k-th partial whereas the phase of X[k, n] is the 
phase of the envelope of the k-th partial; the former phase measurement is 
needed for the sinusoidal model. In the next section, it will be shown that rigid 
association of the subband signals to partials is basically inappropriate for either 
case; the modulated STFT analysis filter bank, however, more readily provides 
the information necessary to derive a generalized sinusoidal signal model. 


2.2.2. Limitations of the STFT and Parametric Extensions 


The interpretation of the STFT as a modulated filter bank leads to a variety of 
modeling implications. These issues in some sense revolve around the nonpara- 
metric representation of the signal in terms of subbands and the use of a rigid 
filter bank for synthesis. This section deals with the limitations of the STFT; 
the considerations motivate parametric extensions of the STFT that overcome 
some of these limitations. 


Partial tracking. The most immediate limitation of the short-time Fourier 
transform results from its fixed structure. A sinusoid with time-varying fre- 
quency will move across bands; this evolution leads to delocalization of the 
representation and a noncompact model. Consider the example shown in Fig- 
ure 2.4, in which a sinusoid of linearly increasing frequency, i.e. a linear chirp, 
is modeled by a nonsubsampled STFT filter bank where the analysis and syn- 
thesis filter prototypes are both square-root Hanning windows (c = 5). The 
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Figure 2.4. Reconstructed subband signals in an nonsubsampled STFT filter bank model 
of a chirp signal. The signals £[n] correspond to those labeled in Figure 2.3 for k = 
{1,2,3,4}. In the simulation, N = 128, K = 128, and LD = 1; w[n] and v[n] are 


square-root Hanning windows. 


parameters of the STFT are K = 128, L=1, and N = 64; the chirp frequency 
starts at wo = 27/K and increases by that amount every 250 samples. 

Figure 2.4 shows the real parts of the reconstructed subband signals for 
bands k = 1,2,3,4. It is necessary to consider the real parts for the following 
reason: the subband signals in the STFT are complex-valued as a result of 
the complex modulation of the filters. For real signals, the STFT yields a 
conjugate symmetric representation like the underlying DFT; each of these 
subband signals has a conjugate version. This observation motivates cosine- 
modulated filter banks where the prototype filters are modulated with a real 
cosine instead of a complex sinusoid. Then, the subband signals are real-valued, 
which is certainly desirable in some cases; here, however, it is problematic 
since the phase provided by the complex filter bank is important for sinusoidal 
modeling as will be seen. While cosine-modulated filter banks have interesting 
and significant properties [141, 142, 178, 232, 238], they are an offshoot of 
the progression of ideas that leads to the sinusoidal model and will not be 
considered in depth here because of this phase problem. 

Returning to the example of Figure 2.4, it is clear that the subbands of the 
fixed filter bank do not provide a compact representation of the chirp signal. 
As the chirp evolves in time, it moves across the bands of the filter bank, and 


SINUSOIDAL MODELING 43 


0) 100 200 300 400 500 600 700 800 900 
02 
Re{X,[n]} 0 
0.2 
0.4 

0 100 200 300 400 500 600 700 800 900 
; 02 
Re{xXo[{n]} 
0.2 
0.4 

0 100 200 300 400 500 600 700 800 900 
0.2 
Re{x3[n]} 0 
-0.2 
-0.4 

0 100 200 300 400 500 600 700 800 900 
02 
Re {x4 [n] } 0 
0.2 
0.4 

0 100 200 300 400 500 600 700 800 900 


Time (samples) 


Figure 2.5. Reconstructed subband signals in a subsampled STFT filter bank model 
of a chirp signal. The signals £,[n] correspond to those labeled in Figure 2.3 for 
k = {1,2,3,4}. In the simulation, N = 128, K = 128, and L = 64; w[n] and 
v[n] are square-root Hanning windows. 


as a result the STFT does not identify this as a single evolving sinusoid but 
instead as a conglomeration of short-lived components, 7.e. the subband signals 
shown in Figure 2.4. Whereas this may seem useful in that it carries out a 
granulation of the chirp signal, inspection of the signal components shows that 
the subband grains are not well-localized in time; note that the transients in 
the original signal are manifested in all of the subband signals as pre-echoes. 
Figure 2.5 shows the model of the same chirp signal using a subsampled STFT 
filter bank with L = 64. This example is perhaps more practical than the 
nonsubsampled case in that there is much less data in the representation, but 
this practicality comes at the cost of more substantial localization problems in 
the subbands. Perfect reconstruction can be achieved in this case; the various 
artifacts cancel in the synthesis. The presence of these artifacts, however, 
renders the signal decomposition problematic for modifications; if the subbands 
are modified, e.g. quantized, the subband artifacts will not be properly cancelled 
and artifacts will appear in the final synthesis. 

In pseudo-periodic musical signals, the frequencies of the harmonics vary 
as the pitch evolves in time; in such cases, it is intuitively desirable that the 
sum-of-partials model should be an aggregation of chirps whose frequencies are 
coupled while changing in time in a complex way. For such signals, unlike 
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in the single chirp case, all of the STFT filter bank subbands will generally 
have significant energy throughout the duration of the signal, so inspection of 
the subbands will not necessarily indicate that the various partials are moving 
across the bands. When all of the subbands have significant energy, it may seem 
reasonable to interpret the subbands as the partials of the sinusoidal model as 
has been discussed; this perspective, however, is in contention with the physical 
foundation of the natural signal. The generating mechanism for a signal whose 
harmonic structure varies in time is a system with a physical parameter, such 
as a string length, that is correspondingly time-varying, and a meaningful rep- 
resentation should capture this foundation. Rather than imposing structure on 
the partials by defining them to exist within subbands as in the STFT model, 
the partials should be based on tracking the time-frequency evolution of the 
signal. As will be seen, this tracking effort is what makes the sinusoidal model 
fundamentally signal-adaptive. 

One approach to the problem of partial tracking in an STFT filter bank is 
to make the filter bank pitch-adaptive so that the subbands do correspond 
to physically reasonable partials; in that method, which was considered in a 
preliminary fashion in [48], signal adaptivity improves the model. A pitch- 
adaptive filter bank, however, does not account for the more general case of 
signals composed of inharmonic partials with unrelated frequency evolution 
behavior, for instance a percussive sound such as a cymbal clash; modeling 
such an arbitrary signal requires a more flexible approach. 


Time-domain aliasing cancellation. Time-domain aliasing was mentioned 
in the discussions of both the overlap-add and the filter bank summation syn- 
thesis methods; in those treatments, it was assumed that K was large enough 
that time-domain aliasing was not introduced. In this section, the issue of 
time-domain aliasing is explored; the treatment leads to general perfect recon- 
struction constraints for modulated filter banks and various implications for 
signal modeling. This issue is discussed here more for the sake of complete- 
ness than as a prerequisite for the development of the general sinusoidal model. 
Essentially, time-domain aliasing cancellation is a fix that allows for perfect 
reconstruction despite a lack in frequency resolution; with this in mind, the 
importance of frequency resolution in sinusoidal modeling implies that STFT 
filter banks that incorporate time-domain aliasing cancellation will not be of 
interest in future considerations. 

For a signal a{n] of length N on [0,N — 1], application of a size K DFT 
followed by a size K IDFT corresponds to 


1 K-1 (N-1 
an] = = > ajersrtmh friar (2.48) 
k=0 \m= 
1 N-1 K-1 
— j2rk(n—m)/K 
= = di alm] Dy eK MK (2.49) 


m=0 k=0 
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Using the simplification for the sum over k given in Equation (2.40) yields 


N-1 oo 
a[n] = > a{m|] Ss) din —m+rK] (2.50) 
= > a(n + rK], (2.51) 


where the r values in the sum of the last expression correspond to values of 
n+rkK that fall within the span of the signal, namely 


0<nt+rk < N-1 for 0<n < N-1. (2.52) 


This formulation explains the condition on K imposed in the earlier treat- 
ments; if K > N, time-domain aliasing is not introduced because only the 
r = 0 term contributes to the reconstruction. On the other hand, if K < N, 
the signal is aliased in the time domain. Fundamentally, this aliasing is a re- 
sult of insufficient spectral sampling of the continuous function A(e”), the 
discrete-time Fourier transform (DTFT) of a[n], and is thus analogous to the 
frequency-domain aliasing that occurs when a continuous time-domain signal is 
sampled below the Nyquist rate. The DTFT, the DFT, and spectral sampling 
are discussed further in Section 2.5.1. 

The effect of time-domain aliasing on the perfect reconstruction condition 
can be readily formalized; the following derivation uses the overlap-add syn- 
thesis framework, but the filter bank summation approach yields the same 
condition, within a gain factor of K. For the signal segment 


aj|[n] = win —iL]z[n], (2.53) 
the reconstructed version of the segment is given by 


a;[n] = > a;[n+rkK] = > w[n+rK —iL]z[n+ rk]. (2.54) 


The OLA synthesis of the signal, with synthesis window v[n], is given by 


#{n] = divin - iL]ai{n], (2.55) 


Substituting for G;[n] and changing the order of the sums yields 
z[n] = So 2[n +rkK] > y[n — iL]w[n + rK — iL}. (2.56) 
If [n] = z[n] is to hold, every term but r = 0 must be cancelled in the other 


sum; the perfect reconstruction constraint is thus 


So uln -iLlw[n+rK -iL] = ofr). (2.57) 


t 
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In the nonsubsampled case with v[n] = 6[n], this simplifies to 
w(rK] = dr], (2.58) 


which is reminiscent of the constraint for designing interpolation filters [164, 
172]. Note that since the time index is the start of the window in this treatment, 
the most appropriate synthesis window is actually given by u[n] = d[n — nol, 
where ng corresponds to the middle of the analysis window. The final constraint 
on the analysis window is then w[np9 +rK] = 6[r], which is satisfied by any 
function with zeros at no + rK for all r 4 0 and a nonzero value at no, which 
can be scaled to unity for gain compensation. A useful class of windows that 
meet this constraint can be constructed by multiplying a perfect reconstruction 
window by an appropriate sinc function. As mentioned earlier, perfect recon- 
struction windows can be virtually arbitrary in the nonsubsampled case; here, 
the formulation is most appealing if the perfect reconstruction windows under 
consideration are those that apply to subsampled cases. This class of aliasing 
cancellation windows are given by 


sin [7(n — no)/K] 


w[n] = wWpa([n] a(n = Tp) 


; (2.59) 
where the sinc function, as written, will introduce a gain of 1/K. The frequency 
response of the resultant window is the spectrum of wp,[n] convolved with an 
ideal lowpass filter with cutoff frequency 7/K; this convolution relationship 
implies that w[n] is a broader lowpass filter than wp,_[n], which corroborates the 
previous statement that time-domain aliasing cancellation and lack of frequency 
resolution are coupled. 

As indicated above, the design of time-domain aliasing cancellation windows 
in the subsampled case is more restricted than in the nonsubsampled case; 
in other words, there is limited freedom in the design of subsampled STFT 
filter banks that employ time-domain aliasing cancellation. The subsampling 
limits the design possibilities since it introduces frequency-domain aliasing, the 
cancellation of which is an underlying principle in the equivalent constraints 
of Equations (2.13) and (2.45), and is indeed part of the general constraint 
given above in Equation (2.57). The critically sampled case L = K is of 
special interest since the representation and the original signal intrinsically 
contain the same amount of data. For critical sampling, however, it can be 
shown that the only FIR solutions correspond to windows with N = K nonzero 
coefficients [237, 238]. In the straightforward solution of this form, the N 
nonzero coefficients are all in the interval [0,N — 1]. Intuitively, there are 
no solutions of this form for N < K since gaps would result in the window 
overlap and various regions of the signal would simply be missed in the analysis- 
synthesis. On the other hand, the reason that there are no solutions for N > K 
is less intuitive; this result is proved in [237, 238]. In the critically sampled 
case, then, the STFT in effect implements a block transform with block size 
N; quantization then leads to discontinuities at the block boundaries, which 
results in undesirable frame rate artifacts in audio and blockiness in images. 
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Furthermore, pre-echo distortion occurs in the reconstruction where the original 
signal has transient behavior; pre-echo is a common problem in near-perfect 
reconstruction models such as filter banks with subband quantization [26]. 


The requirement that N = K = LT in the critically sampled case means 
that there are no critically sampled perfect reconstruction STFT filter banks 
that employ time-domain aliasing cancellation. However, time-domain aliasing 
cancellation can be incorporated in critically sampled cosine-modulated filter 
banks; such filter banks are commonly used in audio coding [26, 178, 210, 
223, 228]. The ability to use time-domain aliasing cancellation in a cosine- 
modulated filter bank is connected to the result that the expansion functions 
in a cosine-modulated filter bank can have good time and frequency localiza- 
tion [238]. Note that the lapped orthogonal transforms (LOT) mentioned in 
Section 1.4 belong to this class of filters. In the LOT, the representation is 
critically sampled but all of the basis functions are smooth and extend beyond 
the boundaries of the signal segment or block; this overlap reduces the artifacts 
caused by quantization. Quantization effects can also be reduced by oversam- 
pling; overcomplete representations exhibit a robustness to quantization noise 
that is proportional to the redundancy of the representation [24, 39, 40, 94). 


Time-frequency localization. As discussed above, the design of STFT fil- 
ter banks is very limited in the critically sampled case. The only real-valued 
prototype windows that lead to orthogonal perfect reconstruction filter banks 
are rectangular windows [238]. This result is a discrete-time equivalent of the 
Balian-Low theorem, which states that there are no continuous-time orthogo- 
nal short-time Fourier transform bases that are localized in time and frequency, 
where the localization is measured in terms of A; and A,, from Equations (1.30) 
and (1.31); either or both of these uncertainty widths are unbounded for or- 
thonormal STFT bases. This problem motivates the use of cosine-modulated 
filter banks, which can achieve good localization in time and frequency [238]. 

Further issues regarding time-frequency localization and filter banks are be- 
yond the scope of this book; this issue will thus not be addressed further, with 
the exception of various considerations of signal expansions, which have a fun- 
damental relationship to filter banks. The point of this discussion is simply to 
cite the result that there are some difficulties with critically sampled STFT fil- 
ter banks, and that oversampling is thus required in order for STFT filter banks 
to perform well. The use of oversampling, however, is contrary to the goal of 
data reduction. This problem is solved in the sinusoidal model by applying a 
parametric representation to the STFT to achieve compaction. 


Modification of the STFT. Various signal modifications based on the 
STFT have been discussed in the literature [4, 5, 38, 96, 173, 190, 213, 214]. In 
approaches where the modifications are based directly on the function X (k,i), 
the techniques are inherently restricted to a rigid framework because the signal 
is being modeled in terms of subbands which interact in complicated ways in 
the reconstruction process. The restrictive framework is exactly this: a modi- 
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fication is carried out on the subband signals and the effect of the modification 
on the output signal is then formulated [5, 173]. This approach is much differ- 
ent from the desired framework of simply carrying out a particular modification 
on the original signal. 


In some approaches, modifications are based on the STFT magnitude only; 
the magnitude is first modified and then a phase that will minimize synthesis 
discontinuities is derived [96, 213, 214]. This removal of the phase essentially 
results in a parametric representation that is more flexible than the complex 
subband signals. It is important to note that this magnitude-only description 
has the same caveat as other parametric models: for the case of no modification, 
the magnitude-only description is not capable of perfect reconstruction. 


In the critically sampled case, there is a one to one correspondence between 
signals and short-time Fourier transforms; because it is a basis expansion, there 
is no ambiguity in the relationship between the domains. In the oversampled 
case, however, many different STFTs will yield the same signal. This multi- 
plicity is obviated by considering the simplest case: L = 1 and v[n] = 6[nJ; 
the analysis window w[n], which derives the STFT, is virtually unrestricted. 
Such an overcomplete representation has a higher dimension than the signal 
space, meaning that some modifications in that space may have no effect on 
the signal or may produce an otherwise unexpected result; in deriving a phase 
for the STFT magnitude for synthesis in the overcomplete case, there are thus 
consistency or validity concerns that arise [9]. 


The issues of aliasing cancellation and validity, among others, indicate the 
fundamental point: the synthesis model limits the modification capability. 
Given that the most effective modification methods for the STFT rely on 
parameterization, there is in some sense no need to use a rigid filter-based 
structure for synthesis. This observation is the fundamental motivation for 
the sinusoidal model, which relies on an STFT analysis filter bank for param- 
eter estimation, but thereafter uses a fully parametric synthesis to circumvent 
issues such as frame boundary discontinuities, consistency, and aliasing can- 
cellation. This idea is essentially three steps removed from analysis-synthesis 
with an STFT filter bank; the two intermediate approaches in the progression 
of techniques are the channel vocoder and the phase vocoder. 


The channel vocoder. The term vocoder, a contraction of voice and coder, 
was coined to describe an early speech analysis-synthesis algorithm [52]. In 
particular, the channel vocoder originated as a voice coder which represented 
a speech signal based on the characteristics of the STFT filter bank channels 
or subbands. Specifically, the speech is filtered into a large number of channels 
using an STFT analysis filter bank. Each of the subbands is modeled in terms 
of its short-time energy; with respect to the k-th channel, this provides an 
amplitude envelope A;[n] which is used to modulate a sinusoidal oscillator at 
the channel center frequency w,. The outputs of these oscillators are then 
accumulated to reconstruct the signal. Note that the term “vocoder” has at 
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Figure 2.6. Block diagram of the phase vocoder. The amplitude and frequency (total 
phase) control functions for the K oscillators are derived from the filter bank subband 
signals by the parameter estimation blocks. 


this point become a general designation for a large number of algorithms which 
are by no means limited to voice coding applications. 


The phase vocoder. The channel vocoder parameterizes the subband sig- 
nal in terms of its energy or amplitude only; the phase vocoder is an extension 
that includes the phase behavior in the model parameterization as well. In 
the literature, the term phase vocoder is generally synonymous with the short- 
time Fourier transform [158], but in many applications the approach involves 
interpreting the STFT analysis data with respect to a structure like the one 
shown in Figure 2.6, where the subband signals are parameterized in terms 
of magnitude envelopes and functions that describe the frequency and phase 
evolution. These functions serve as inputs to a bank of oscillators that re- 
construct the signal from the parametric model [49, 63, 156, 172, 179]. If the 
analysis filter bank is subsampled, the sample-rate oscillator control functions 
are derived from the subsampled frame-rate STFT representation. This phase 
vocoder structure has been widely applied to modification of speech signals; 
the success of such approaches substantiates the contention that modification 
capabilities can be improved by using a parametric model and a parametric 
synthesis. 


General sinusoidal models. The phase vocoder as depicted in Figure 2.6 
does not solve the partial tracking problem discussed earlier; while its para- 
metric nature does enable modifications, it is still of limited use for modeling 
evolving signals. A further generalization leads to the sinusoidal model. The 
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Figure 2.7. Block diagram of the general sinusoidal model. The amplitude and frequency 
(total phase) control functions are derived from the filter bank outputs by tracking spectral 
peaks in time as they move from band to band for an evolving signal. The parameter 
estimation block detects and tracks spectral peaks; unless ( is externally constrained, the 
number of peaks detected dictates the number of oscillators used for synthesis. 


fundamental observation in the development of the sinusoidal model is that if 
the signal consists of one nonstationary sinusoid such as a chirp, then synthesis 
can be achieved with one oscillator. There is no need to implement an oscil- 
lator for every branch of the analysis filter bank. Instead, the outputs of the 
analysis bank can be examined across frequency for peaks, which correspond to 
sinusoids in the signal. These spectral peaks can then be tracked from frame to 
frame as the signal evolves, and only one oscillator per tracked peak is required 
for synthesis. This structure is depicted in Figure 2.7. Note that because the 
synthesis is based on a parametric model, the prototype window for the analysis 
filter bank does not have to satisfy an overlap-add property. 

For the chirp signal used in Figures 2.4 and 2.5, a sinusoidal model with 
one oscillator yields the reconstruction shown in Figure 2.8(b). The model 
data for the reconstruction in Figure 2.8(b) is extracted from the same STFT 
produced by the subsampled analysis filter bank of the Figure 2.5 example. 
With respect to data reduction, the one-partial sinusoidal model in Figure 2.8 
is basically characterized by three real numbers {A,w,@} for each signal frame. 
For real signals, the STFT filter bank model consists of K/2 complex numbers 
for each frame, so the compression achieved is significant; this is of course less 
drastic for complicated signals with many partials. Note that the compression 
is accompanied by an inability to carry out perfect reconstruction. A primary 
reconstruction inaccuracy or artifact in the sinusoidal model is pre-echo, which 
is evident in Figure 2.8. This problem is discussed further in Section 2.6; 
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Figure 2.8. One-component sinusoidal model of the chirp signal from Figure 2.5 using 
the same analysis filter bank as in that example. 


in Chapter 3, methods for alleviating the pre-echo distortion are developed. 
Note also that the sinusoidal model provides a better description of the signal 
behavior than the filter bank decomposition; this example illustrates how a 
compact parametric model is useful for analysis. 

In the general sinusoidal model, there are no strict limitations on NV, K, and 
L for the analysis filter bank. Typically, K > N, meaning that oversampling 
in frequency is used, which in some cases yields a more accurate model than 
critical sampling (K = N) as will be seen in the next section. Note that 
an increase in K corresponds to adding more channels to the filter bank and 
decreasing the frequency spacing between channels; because each filter is simply 
a modulated version of the prototype window, however, the resolution of the 
individual channel filters is not affected by a change in K. Also, it is common 
to use a hop size of L = N/2 to achieve data reduction. As in the filter bank 
case, gaps result in the analysis if L > N, but in this method such gaps can be 
filled in the reconstruction via parameter interpolation. 


2.3 SINUSOIDAL ANALYSIS 


The analysis for the sinusoidal model is responsible for deriving a set of time- 
varying model parameters, namely the number of partials Q[n], which may be 
constrained by rate or synthesis computation limits [69], and the partial ampli- 
tudes {A,[n]} and total phases {O,[n]}. As mentioned, these parameters are 
assumed to be slowly varying with respect to the sample rate, so the estimation 
process can be reliably carried out at a subsampled rate. In [149, 208], this 
analysis is done using a short-time Fourier transform followed by spectral peak 
picking; this procedure was conceptually motivated in the preceding discussion 
of the STFT. The following sections examine this analysis method; alternative 
approaches are also discussed. 


2.3.1 Spectral Peak Picking 


The analysis for the sinusoidal model is similar to many scenarios in which 
the sinusoidal content of a signal is of interest. Approaches based on Fourier 
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transforms have been traditionally applied to these problems. In such methods, 
the signal is transformed into the Fourier domain and the peaks in the spec- 
tral representation are interpreted as sinusoids. In this section, the use of the 
discrete Fourier transform in this framework is considered; various resolution 
limits are demonstrated. The relationship of the discrete-frequency DFT to 
the continuous DTFT underlies some of the issues here; a discussion of this 
relationship, however, is deferred to Section 2.5.1. 


A single sinusoid. The case of identifying a single time-limited complex 
sinusoid is of preliminary importance for these considerations. For the signal 


a[n] = age?” (2.60) 


defined on the interval n € [0, N—1], where ao is a complex number that entails 
the magnitude and phase of the sinusoid, a DFT of size N is given by 


7 N=1 -—n¢ N=1 1 Tk _ wo 
Xn{k] = a eivo(“Z) e ik oR = (NV [Nr 2 D (2.61) 


sin —#) 


where the subscript N denotes the DFT size. This treatment will focus on the 
estimation of sinusoids based on peaks in the magnitude of the DFT, so the 
ratio of sines in the above expression is of more importance than the preceding 
linear phase term. If the frequency of the sinusoid can be expressed as 


21 ko 

oN? 
namely if it is equal to a bin frequency of the DFT, the numerator in this ratio 
is zero-valued for all k, meaning that the DFT itself is zero-valued everywhere 
except at k = ko, where the denominator of the ratio is zero. For k = ko, the 
ratio takes on a value N by L’H6pital’s rule, so the DFT magnitude is N|ap]; 
the phase at k = ko is given simply by argap. Thus, when wo corresponds to 
a bin frequency, the sinusoid can be perfectly identified as a peak in the DFT 
magnitude spectrum, and its magnitude and phase can be extracted from the 
DFT. For sinusoids at other frequencies, however, the N-point DFT has a less 
simple structure. In this case, the signal is indeed represented exactly because 
the DFT is a basis expansion; however, in terms of spectral peak picking it 
is erroneous to interpret the peak in such a DFT as a sinusoid in the signal. 
These cases are depicted in Figures 2.9(a) and 2.9(b), respectively. 


Wo = (2.62) 


Oversampling and frequency resolution. For the case of the off-bin fre- 
quency illustrated in Figure 2.9(b), the sinusoid cannot be immediately iden- 
tified in the DFT spectrum, and the DFT representation of the signal is not 
compact. The parameters of the sinusoid can, however, be estimated by in- 
terpolation. Using an oversampled DFT is one such approach. A DFT of size 
K > N is given by 


: k Ww 
Xx[k] = a0 eiwo(*s*) o—iak( *R*) sin (N [% - F]) (2.63) 
sin (F - 2) 
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Figure 2.9. Estimation of a single sinusoid with the DFT. In (a), the sinusoid is at the bin 
frequency 27ko/N for N = 16 and kg = 3, so an N-point DFT identifies the sinusoid 
exactly. In (b), the frequency is 27(ko + 0.4)/N as indicated by the asterisk in the plot; 
the sinusoid is not identified by the DFT, and the DFT representation of the signal is not 
compact. In (c), an oversampled DFT of size K = 5N is used; here the sinusoid from (b) 
can be identified exactly since Wo = 2m(ko + 0.4)/N = 2n(5ko + 2)/K = 27ko/K. 
In (d), a Hanning window is applied to the signal before the oversampled DFT is carried out. 
In this figure and in Figure 2.10, filled circles indicate when perfect estimation is achieved; 
in cases where the estimation is imperfect, the actual signal components are depicted by 
asterisks. 


In this oversampled case, a sinusoid of frequency 


27 Ko 


K 


WwW = (2.64) 


can be identified exactly as a peak in the spectrum as shown in Figure 2.9(c). 
Sinusoids at other frequencies cannot be immediately estimated from the K- 
point DFT, but higher resolution can be achieved by simply choosing a larger 
K, t.e. by increasing the size of the DFT. 

The spectral representation in Figure 2.9(c) is not compact because using an 
oversampled DFT corresponds to padding the end of the signal with K—N zeros 
prior to taking the K-point DFT. The signal is then equivalent to a sinusoid 
of length K time-limited by a window of length N, which means that the 
spectrum corresponds to the K-point DFT of a sinusoid of length K circularly 
convolved with the K-point DFT of a rectangular window of length N. The 
time localization provided by this window induces a corresponding frequency 
delocalization. 

In STFT filter banks, as mentioned earlier, oversampling in frequency is 
simply equivalent to adding more filters to the filter bank and decreasing their 
frequency spacing; this is readily indicated in the following consideration. For 
an analysis window w(n] of length N, the filters in an N-channel filter bank are 
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given by 
Areto,n—1}|"] = w[—n]e?7*n/ ". (2.65) 


In terms of the STFT tiling in Figure 2.1, this corresponds to using a critically 
sampled DFT for each vertical slice of the tiling. In a K-channel filter bank 
with K > N, which corresponds to using an oversampled DFT, the filters are 
modulated versions of the same prototype window as in the N-channel case, 
namely 


hreqo,K-1}[n] = w[-n]e?™*/*, (2.66) 


but the spacing of the channels is now 27/K, which is less than the 27/N 
spacing in the previous case. 

For a single DFT, 7.e. one short-time spectrum in the STFT, oversampling 
in frequency corresponds to time-limited interpolation of the spectrum. Other 
methods of spectral interpolation can also be used to identify the location 
of the spectral peak; these are generally based on application of a particular 
window to the original signal. Then, the sinusoid can be identified if the shape 
of the window transform can be detected in the spectrum; the performance 
of such methods has been considered in the literature for the general case of 
multiple sinusoids in noise [22, 99, 207]. This matching approach is particularly 
applicable when a Gaussian window is used since the window transform is then 
simply a parabola in the log-magnitude spectrum; by fitting a parabola to the 
spectral data, the location of a peak can be estimated. Such interpolation 
methods can be coupled with oversampling. An example is given in Figure 
2.9(d), in which a Hanning window is applied to the data prior to zero padding; 
note that this windowing broadens the main lobe of the spectrum but reduces 
the sidelobes. 


Two sinusoids. The case of a single sinusoid is of limited interest for mod- 
eling musical signals. With a view to understanding the issues involved in 
modeling complicated signals, the considerations are extended in this section 
to the case of two sinusoids. It will be shown by example that the interference 
of the two components in the frequency domain leads to estimation errors; it 
is shown to be generally erroneous in multi-component signals to assume that 
a spectral peak corresponds exactly to a sinusoid in the signal. The goal of 
alleviating such errors will serve to motivate certain design constraints. 

The signal in question will simply be a sum of unit-amplitude, zero-phase 
sinusoids defined on n € [0, N — 1]: 


a[n] = efor 4 ejuin, (2.67) 


When wo and w, both correspond to bin frequencies of an N-point DFT, both 
sinusoids can be estimated exactly in the DFT spectrum as indicated in Figure 
2.10(a). As shown in Figures 2.10(b) and 2.10(c), the N-point DFT cannot 
identify the sinusoids if either of the frequencies is off-bin. The situation is 
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Figure 2.10. Estimation of two sinusoids with the DFT. In (a), the sinusoids are at bin 
frequencies 27ko/N and 27k, /N for N = 16, ko = 3, and ko = 4; an N-point 
DFT identifies the sinusoids exactly. As in Figure 2.9, filled circles indicate when perfect 
estimation is achieved; in cases with imperfect estimation, the actual signal components are 
indicated by asterisks. In (b), w is moved off-bin to 27(Ko + 0.4)/N as shown by the 
asterisk; in (c), w1 is moved off-bin to 27(ko + 1.2)/N. In either case, the sinusoids are 
not identified by the DFT. In (d), an oversampled DFT of size K = 5N is used for the 
sinusoids in (b); these are not resolved by oversampling. In (e), oversampling is applied for 
the case in (c); because these sinusoids are separated in frequency, oversampling improves 
the resolution. The plot in (f) depicts a more extreme case of frequency separation in 
which the sinusoids can again be reasonably identified. Note that in (d), (e), and (f), the 
sinusoids cannot be resolved even though their frequencies can be expressed as 27Kg/K and 
2mK1/K for integer Ko and K1; this difficulty results from the interference of the sidelobes 
in the combined spectrum, or equivalently because the components are not orthogonal as 
will be explained in Section 2.3.2. 


particularly bleak in Figure 2.10(b), where the two sinusoids are close in fre- 
quency. 

In the case of a single sinusoid, oversampling was used to improve the fre- 
quency resolution. For the case of two closely spaced sinusoids, oversampling 
does not provide a similar remedy. As depicted in Figure 2.10(c), closely spaced 
sinusoids in an oversampled DFT appear as a single lobe; neither component 
can be accurately resolved, and it is inappropriate to identify the spectral peak 
as a single sinusoid in the signal. Figures 2.10(d) and 2.10(e) show that the 
resolution of the oversampled DFT tends to improve as the frequency difference 
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increases. Note that in all of the simulations, wo = 27K9/K and w, = 27k, /K 
for some integers Ko and k,. This choice of frequencies provides a best-case 
scenario for the application of oversampled DFTs, and yet various errors still 
occur; the peaks in the spectrum do not generally correspond to the sinusoids 
in the signal, so estimation of the sinusoidal components by peak picking is 
erroneous. 


Resolution of harmonics. As evidenced in Figure 2.10, separation of the 
spectral lobes improves the ability to estimate the sinusoidal components. This 
property can be used to establish a criterion for choosing the length of the 
signal frame N in STFT analysis. A reasonable limiting condition for approx- 
imate resolution of two components is that two main lobes appear as separate 
structures in the spectrum; this occurs when the component frequencies differ 
by at least half the bandwidth of the main lobe, where the bandwidth is defined 
here as the distance between the first zero crossings on either side of the lobe. 
Mathematically, this condition leads to the constraint 


luo —uri| > ad (2.68) 
which is independent of the oversampling factor; oversampling helps in identi- 
fying off-bin frequencies that are widely separated, but does not improve the 
resolution of closely spaced components. In short, the constraint simply states 
that components must be separated by at least a bin width in an N-point DFT 
to be resolved; this requirement was already suggested in Figure 2.10(b), and 
will play a further role in the next section. Note that the constraint in Equation 
(2.68) involves the standard tradeoff between time and frequency resolution; if 
N is large, accurate frequency resolution is achieved, but this comes with a 
time delocalization penalty resulting from using a large window. 

The constraint in Equation (2.68) cannot be applied without some knowl- 
edge of the expected frequencies in the signal. While this is a questionable 
requirement for arbitrary signals, it is applicable in the common case of pseudo- 
periodic signals. The components in the harmonic spectrum of a pseudo- 
periodic signal are basically multiples of the fundamental frequency, so the 
constraint can be rewritten as 


WranalV¥ > 27. (2.69) 


Note that this constraint can be interpreted in terms of the number of periods 
of the fundamental frequency, é.e. pitch periods of the signal, that occur in 
the length-N frame; for the components to be resolvable, it is required that at 
least one period be in the frame. When the N-point window spans exactly one 
period, an N-point DFT provides exact resolution of the harmonic components; 
this observation will come into play in the pitch-synchronous sinusoidal model 
discussed in Chapter 5. 

The formulation of the constraint in Equation (2.69) implicitly assumes the 
use of a rectangular window. For a Hanning window, the main spectral lobe is, 
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Figure 2.11. Modeling a two-component signal via peak picking in the DFT. In the two- 
component signal of length N = 64, the frequencies are at 27Ko/K and 27k,/K for 
Ko = 15, ki = 17, and K = 5N. The sinusoids are closely spaced, so a peak picking 
process finds only one sinusoid. The signal is indicated by the solid line in the plot; the 
dotted line indicates the sinusoid estimated by peak picking. 


by construction, twice as wide as that of a rectangular window; as a result, a 
Hanning window must span two signal periods to achieve resolution of harmonic 
components. Since Hanning and other similarly constructed windows have been 
commonly used, it has become a heuristic in STFT analysis to use windows of 
length two to three times the signal period. 


Modeling arbitrary signals. Analysis based on the DFT has been used 
in numerous sinusoidal modeling applications [149, 207, 208]. These meth- 
ods incorporate the constraints discussed above for resolution of harmonics 
and have been successfully applied to modeling signals with harmonic struc- 
ture. Furthermore, the approaches have also shown reasonable performance for 
modeling signals where the sinusoidal components are not resolvable and peak 
picking in the DFT spectrum provides an inaccurate estimate of the sinusoidal 
parameters. This issue is examined here. 

Consider a signal of the form given Equation (2.67) with component frequen- 
cies Wo and w closely spaced as in Figures 2.10(b) and 2.10(d). In this case, 
peak picking in the oversampled DFT spectrum identifies a peak between wo 
and w, and interprets this peak as a sinusoid in the signal. At this point, it is as- 
sumed that the DFT is oversampled such that wo = 27K9/K and w, = 27K, /K 
for integers Ko and Kk, = Ko + 2%, where i is an integer; this condition simply 
means that there will be an odd number of points in the oversampled DFT 
between Ko and «,. When kg and «; are related in this way, the oversampled 
DFT has a peak midway between Ko and k, at the location 


Ko + Ky —s _ Wo FW 


3 Wp = 5 (2.70) 


Kp = 


The analysis interprets this peak as a sinusoid in the signal with frequency wp 
and with amplitude and phase given respectively by the magnitude and phase 
of the oversampled DFT at the bin «,. An example of a two-component signal 
and the signal estimate given by peak picking is indicated in Figure 2.11. 
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In considering the signal estimate for the case of closely spaced sinusoids, it 
is useful to rewrite the two-component signal as 


z[n] = ef(“9et)n [es(“aa)n 4 el )n| (2.71) 
= 2cos[(wo — w)n/2] e3”?", (2.72) 


which indicates that the signal can be written as a sinusoid at wp with an am- 
plitude modulation term. With regards to the DFT spectrum, the broad lobe 
resulting from the overlap of the narrow lobes of the two components can be 
interpreted as a narrow lobe at the midpoint frequency that has been widened 
by an amplitude modulation process. It is useful to note the behavior of this 
modulation for limiting cases: the closer the spacing in frequency, the less vari- 
ation in the amplitude, which is sensible since the components become identical 
aS Wo — w1; for wider spacing in frequency, the modulation becomes more and 
more drastic, but this is accompanied by an improved ability to resolve the com- 
ponents. The intuition, then, is that when the components cannot be resolved, 
the modulation is smooth within the signal frame. This modulation interpre- 
tation is not applied in the DF T-based sinusoidal analysis, which estimates the 
signal components in a frame in terms of constant amplitude sinusoids. As 
will be discussed in Section 2.4.2, however, the synthesis routine constructs an 
amplitude envelope for the partials estimated in the frame-to-frame analysis; 
this helps to match the amplitude behavior of the reconstruction to that of the 
signal. In other words, smooth modulation of the amplitude can be tracked by 
the model. 

The example discussed above involves a somewhat ideal case. For one, the 
formulation is slightly more complicated when the component amplitudes are 
not equal. Furthermore, when the assumptions previously made about the 
component frequencies do not hold, the peak picking process becomes more 
difficult. However, the insights do apply to the case of general signals. For 
arbitrary signals, then, it is reasonable to interpret each lobe in the oversampled 
DFT as a short-time sinusoid. Given this observation, the partial parameters 
for a short-time signal frame can be derived by locating major peaks in the DFT 
magnitude spectrum. For a given peak, the frequency w, of the corresponding 
partial is estimated as the location of a peak and the phase @, is given by the 
phase of the spectrum at the peak frequency w,. Note that in the frame-rate 
sinusoidal model, the estimated parameters are designated to correspond to the 
center of the analysis window, so the phase must be advanced from its time 
reference at the start of the window by adding w,N/2. The amplitude Ag of 
the partial is given by the height of the peak, scaled down by a factor of N 
for the case of a rectangular window. This scaling factor amounts to the time- 
domain sum of the window values, so scaling by N/2 is called for in the case of 
a Hanning window; note that the peak in Figure 2.9(d) is at half the height of 
the peak in Figure 2.9(d). Further scaling by a factor of 1/2 is required if the 
intent is to estimate real sinusoids from a complex spectrum. Also, there is a 
positive frequency and a negative frequency contribution to the spectrum for 
this case of real sinusoids, which can result in spectral interference that may 
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bias the ensuing peak estimation; this is analogous to the estimation errors 
that occur due to sidelobe interference in the two-component case. While this 
method is prone to such errors, it is nevertheless useful for signal modeling; the 
models depicted in later simulations use oversampled DFTs for analysis. 


2.3.2 Linear Algebraic Interpretation 


In the previous section, estimation of the parameters of a sinusoidal model 
using the DFT was considered. It was shown that this estimation process is 
erroneous in most cases, but that the errors can be reduced by imposing cer- 
tain constraints. Here, the estimation problem is phrased in a linear algebraic 
framework that sheds light on the errors in the DFT approach and suggests an 
improved analysis. 


Relationship of analysis and synthesis models. The objective in sinu- 
soidal analysis is to identify the amplitudes, frequencies, and phases of a set of 
sinusoids that accurately represent a given segment of the signal. This problem 
can be phrased in terms of finding a compact model using an overcomplete 
dictionary of sinusoids; the background material for this type of consideration 
was discussed in Section 1.3. For an N x K dictionary matrix whose columns 
are the normalized sinusoids 
1 


dz = Te (2.73) 


such that w;, = 27k/K, the synthesis model for a segment of length N can be 
expressed in matrix form as 


x = Da, (2.74) 


where x and q@ are column vectors. Finding a sparse solution to this inverse 
problem corresponds to deriving parameters for the signal model 


lo, 
z{n| = Ta Done” (2.75) 
k=1 


where many of the coefficients are zero-valued. 

In the previous section, analysis for the sinusoidal model using the DFT was 
considered. The statement of the problem given here, however, indicates that 
the DFT is by no means intrinsic to the model estimation. In general cases, 
the analysis for an overcomplete model such as the one in Equations (2.74) 
and (2.75) requires computation of a pseudo-inverse of D, which is related to 
projecting the signal onto a dual frame; to achieve compaction, a nonlinear 
analysis such as a best basis method or matching pursuit can be used. Given 
this insight, it is clear that even in the limiting case that D is a basis matrix and 
the frequencies are known but not at frequencies 27k/N, the DFT is not essen- 
tial to the estimation problem; the model coefficients are given by correlations 
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with the dual basis. The only case in which the DFT is entirely appropriate for 
analysis of multi-component signals is the orthogonal case where the synthesis 
components are harmonics at the bin frequencies. However, the DFT is still of 
use; in the previous section, it was shown that the errors in the DFT analysis 
are not always drastic. This issue is examined in the next section. 


Orthogonality of components. As stated above, the DFT is only appropri- 
ate for analysis when the synthesis components are orthogonal. This explains 
the perfect analyses shown in Figures 2.9(a) and 2.10(a) for the cases of sinu- 
soids at bin frequencies. The one-component example in Figure 2.9 is not of 
particular value here, though; even in the general overcomplete case described 
above, analysis of one-component signals can be carried out perfectly without 
difficulties. The multi-component case, on the other hand, is problematic and 
is thus of interest. 

Figure 2.10 and the accompanying discussion of frequency separation led to 
the conclusion that components can be reasonable resolved by peak picking in 
the DFT spectrum if the components are spaced by at least a bin. Consider 
two unit-norm sinusoids at different frequencies defined as 


1 ; 1 ; 
go[n] = VN el2mnon/K and giln] = VN ef2mnin/K (2.76) 


The magnitude of the correlation of these two functions is given by: 
sin (eles = ) 


X90, 91 = 75 win (252), (2.77) 
sin mlso—s1] 

This function is at a maximum for Kp = ki, when the sinusoids are equivalent; 
|kq — K1| > K/N, namely separation by more than a bin in an N-point spec- 
trum, corresponds to the sidelobe region, where the values are significantly less 
than the maximum. This insight explains why separation of lobes in the spec- 
trum leads to reasonable analysis results in the DFT approach; when the lobes 
are separated, the signal components are not highly correlated, t.e. are nearly 
orthogonal. Likewise, this explains why DFT analysis for the sinusoidal model 
works reasonably well in cases where the window length is chosen according 
the constraint in Equation (2.69). 


Frames of complex sinusoids. In discussing the sinusoidal model, a lo- 
calized segment of the signal has often been referred to as a frame. Treating 
the sinusoidal analysis in terms of frames of vectors, then, introduces an un- 
fortunate overlap in terminology. For this section, the localized portion of the 
signal will be assumed to be a segment of length N, and the term frame will 
be reserved to designate an overcomplete family of vectors. 

The frame of interest here is the family of vectors 


dy = a jankn/K  y € [(0,.NV _- 1]. (2.78) 
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If K = N, this family is an orthogonal basis and signal expansions can be 
computed using the DFT. For compact modeling of arbitrary signals, however, 
the overcomplete case (K > N) is more useful. Indeed, the oversampled DFT 
can be interpreted as a signal expansion based on this family of vectors: 


N-1 
Xxlk] = > a[nje—227kn/K (2.79) 
n=0 
= WN (dx,2). (2.80) 
The reconstruction can then be expressed as 
1 Ka . 
en) = > Xx [kle??*n/* (2.81) 
k=0 
K-1 
N 
k=0 
K-1 
N 
= K (dx, rd. (2.83) 
k=0 


Recalling the earlier discussion of zero padding, the oversampled DFT can be 
interpreted as an expansion of a time-limited signal on [0,N — 1] in terms 
of sinusoidal expansion functions supported on the longer interval [0,K — 1); 
this interpretation provides a framework for computing a unique expansion 
in terms of an orthogonal basis. Equation (2.83), on the other hand, indicates 
another viewpoint based on the discussion of frames in Section 1.4.2; noting the 
similarity of Equation (2.83) to Equation (1.26), it is clear that the oversampled 
DFT corresponds to a signal expansion in a tight frame with redundancy K/N. 

As discussed in Section 1.4.2, frame expansions of the form given in Equation 
(1.26) are not generally compact. For the oversampled DFT case, such non- 
compactness is depicted in Figures 2.9 and 2.10. These noncompact expansions 
do provide perfect reconstruction of the signal, but this is of little use given 
the amount of data required. Restating the conclusion of the previous section 
in this framework, it is possible in the DFT case to achieve a reasonable signal 
approximation using a highly compacted model based on extracting the largest 
values from the noncompact tight frame expansion. This assertion is verified 
in Figure 2.11 for a simple example; the shortcoming in this example, however, 
is that there is an exact compact model in the overcomplete set that the DFT 
does not identify. With respect to near-perfect modeling of an arbitrary signal, 
the shortcoming is that there are compact models that are more accurate than 
the model derived by DFT peak picking. Arriving at such models, however, is a 
difficult task. It is an open question as to whether the incorporation of such ap- 
proaches in the sinusoidal model improves the rate-distortion performance with 
respect to models based on DFT parameter estimation. Derivation of compact 
models in overcomplete sets is discussed more fully in Chapter 6, but primarily 
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for the application of constructing models based on Gabor atoms. A method 
for sinusoidal modeling based on analysis-by-synthesis using an overcomplete 
set of sinusoids is described in Section 2.3.3. 


Synthesis and modification. In an overcomplete signal model, the com- 
ponents are necessarily not all orthogonal. As discussed in Section 1.4.4, this 
results in a difficulty in the synthesis of modified expansions. Namely, some 
additive modifications will correspond to vectors in the null space; for such 
modifications in the model domain, the time-domain reconstruction remains 
unchanged. Furthermore, given that a component can be expressed as a sum 
of other components, some modifications actually correspond to cancellation of 
a desired component, or, in the worst case, cancellation of the entire signal. It 
is thus important to monitor modifications carried out in overcomplete expan- 
sions so as to avoid these pitfalls. A formal consideration of these problems is 
left as an open issue. 


While overcomplete sinusoidal models have been widely used for signal modi- 
fication, the problems discussed above have not been explicitly discussed in the 
literature. It will be seen in later sections that the parametric structure of 
the sinusoidal model allows for resolution of some signal cancellation issues; 
a specific fix discussed in Section 2.5.2 is that phase matching conditions can 
be imposed on additive components at similar frequencies to prevent destruc- 
tive interference. Furthermore, cancellation issues are circumvented to a great 
extent in applications involving sinusoids separated in frequency; as shown 
in Equation (2.77), such sinusoids are nearly orthogonal. Synthesis based on 
nearly orthogonal components of an overcomplete set is well-conditioned with 
respect to modification, so the sinusoidal model performs well in such scenarios. 


2.3.3 Other Methods for Sinusoidal Parameter Estimation 


A number of alternative methods for estimating the parameters of sinusoidal 
models have been considered in the literature. A brief review is given below; 
the focus is placed primarily on methods that introduce substantial model 
adjustments. 


Analysis-by-synthesis. In analysis-by-synthesis methods, the analysis is 
tightly coupled to the synthesis; the analysis is metered and indeed adapted 
according to how well the reconstructed signal matches the original. Often this 
is a sequential or iterative process. Consider an example involving spectral 
peak picking: rather than simultaneously estimating all of the peaks, only the 
largest peak is detected at first. Then the contribution of a sinusoid at this 
peak, i.e. a spectral lobe and perhaps sidelobes as well, is subtracted from the 
spectrum, and the next peak is detected; this approach can be used to account 
for sidelobe interaction. One advantage of this structure over straightforward 
estimation is that it allows the analysis to adapt to reconstruction errors; esti- 
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mation errors are accounted for in subsequent iterations. On the other hand, 
this approach can have difficulties because of its greedy nature. 

The matching pursuit algorithm to be discussed in Chapter 6 is an analysis- 
by-synthesis approach; this notion will be elaborated upon considerably at that 
point. Here, it suffices to note that analysis-by-synthesis has been applied ef- 
fectively in sinusoidal modeling, especially in the case where the sinusoidal 
parameters are estimated directly from the time-domain signal [76, 236]. The 
particular technique of [76] employs a dictionary of short-time sinusoids and is 
indeed an example of a method that bridges the gap between parametric and 
nonparametric approaches. At each stage of the analysis-by-synthesis iteration, 
the dictionary sinusoid that best resembles the signal is chosen for the decom- 
position; its contribution to the signal is then subtracted and the process is 
repeated on the residual. Though it uses a dictionary of expansion functions 
and should thus be categorized as a nonparametric method according to the 
heuristic distinctions of Sections 1.3 and 1.4, the algorithm indeed results in a 
parametric model since the dictionary sinusoids can be readily parameterized. 


Global optimization. The common methods of sinusoidal analysis yield 
frame-rate signal model parameters. Generally the analysis is independent 
from frame to frame, meaning that the parameters derived in one frame do 
not necessarily depend on the parameters of the previous frame; in some cases 
the estimation is guided according to pitch estimates and models of the signal 
evolution, but such guidance is generally localized among nearby frames. If 
the entire signal is considered as a whole in the sinusoidal analysis, a globally 
optimal set of model parameters can be derived. Such optimization is a highly 
complex operation which requires intensive off-line computation [46]. This issue 
is related to the method to be discussed in Section 3.4, in which a slightly re- 
stricted global modeling problem is phrased in terms of dynamic programming 
to reduce the computational cost [174]. 


Statistical estimation. A wide variety of methods for estimating the param- 
eters of sinusoidal and quasi-sinusoidal models have been presented in the spec- 
tral estimation literature. These differ in the structure of the models; some of 
these differences include assumptions about harmonicity and the behavior of the 
partial amplitudes, the effects of underestimating or overestimating the model 
order (the number of sinusoids to be estimated), the presence of noise or other 
contamination, and the metrics applied to determine the parameters, e.g. mini- 
mum mean-square error, maximum likelihood, or a heuristic criterion. Key ref- 
erences for these other methods include [71, 114, 115, 119, 122, 203, 222, 229]. 


2.4 TIME-DOMAIN SYNTHESIS 


Synthesis for the sinusoidal model is typically carried out in the time domain 
by accumulating the outputs of a bank of sinusoidal oscillators in direct ac- 
cordance with the signal model of Equation (2.1). This notion was previously 
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Figure 2.12. Time-domain sinusoidal synthesis using a bank of oscillators. The amplitude 
and phase control functions can be derived using an STFT analysis as depicted in Figure 
2.7, or in other ways as described in the text. 


depicted in Figure 2.7; the simple structure of the synthesis bank is given again 
in Figure 2.12 to emphasize a few key points. First, banks of oscillators have 
been widely explored in the computer music field as an additive synthesis tool 
(155, 197, 199]. Early considerations, however, were restricted to synthesis 
of artificial sounds based on simple parameter control functions since corre- 
sponding analyses of natural signals were unavailable and since computational 
capabilities were limited. The development of analysis algorithms has led to the 
application of this approach to modeling and modification of natural signals, 
and advances in computation technology have enabled such synthesis routines 
to be carried out in real time [68, 69]. 

Figure 2.12 also serves to highlight the actual functions A,[n] and O,[n]. The 
output of the q-th oscillator is A,[n]cos@,[n]; it is controlled by sample-rate 
amplitude and total phase functions that must be calculated in the synthesis 
process using the frame-rate (subsampled) analysis data. This calculation in- 
volves two difficulties: line tracking and parameter interpolation, both of which 
arise because of the time evolution of the signal and the resultant analysis pa- 
rameter differences from frame to frame; for instance, the estimated frequencies 
of the partials change in time as the spectral peaks move. Of course, given the 
intent of generalizing the Fourier series to have arbitrary sinusoidal components, 
it is not surprising that some difficulties arise. 


2.4.1 Line Tracking 


The sinusoidal analysis provides a frame-rate representation of the signal in 
terms of amplitude, frequency, and phase parameters for a set of detected 
sinusoids in each frame. This analysis provides the sinusoidal parameters, but 
does not indicate which parameter sets correspond to a given partial. To builda 
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signal model in terms of evolving partials that persist in time, it is necessary to 
form connections between the parameter sets in adjacent frames. The problem 
of line tracking is to decide how to connect the parameter sets in adjacent frames 
to establish continuity for the partials of the signal model. Such continuity 
is physically reasonable given the generating mechanism of a signal, e.g. a 
vibrating string. 


Line tracking can be carried out in a simple successive manner by associ- 
ating the q-th parameter set in frame 7, namely {Ay ;,W9,i,¢9,i}, to the set in 
frame i + 1 with frequency closest to w,,; [149]. The tracking starts by making 
such an association for the pair of parameter sets with the smallest frequency 
difference across all possible pairs; frequency difference is used as the metric 
here, but other cost functions, perhaps including amplitude or a predicted rate 
of frequency change, are of course plausible. Once the first connection is es- 
tablished, the respective parameter sets are taken out of consideration and the 
process is repeated on the remaining data sets. This iteration is continued until 
all of the sets in adjacent frames are either coupled or accounted for as births or 
deaths — partials that are newly entering or leaving the signal. Generally, there 
is some threshold set to specify the maximum frequency difference allowed for 
a partial between frames; rather than coupling a pair of data sets that have a 
large frequency difference, such instances are treated as a separate birth and 
death. This tracking is most effective for relatively stationary signal segments; 
it has difficulty for signal regions where the spectral content is highly dynamic, 
such as note attacks in music. This breakdown is not so much a shortcoming 
of the line tracking algorithm as of the signal model itself; a model consisting 
of smoothly evolving sinusoids is inappropriate for a transient signal. 


For complicated signals with many evolving partials, the problem of line 
tracking is obviously difficult. One important fix, proposed in [207, 208], is 
the use of backward line tracking when necessary; this technique can be used 
to track the partials of a note from the sustain region back to their origins 
in the note attack. Another observation is that line tracking can be aided by 
considering harmonicity; if the partials are roughly harmonic, the data sets can 
be coupled more readily than in the general case [149, 208]. A number of more 
complex methods have been explored in the literature. One noteworthy tech- 
nique involves using the Viterbi algorithm to find the best set of partial tracks 
[14, 246]; the cost of a given set of tracks is generally measured by summing 
the frame-to-frame absolute frequency differences along all of the tracks in the 
set. This approach finds the set of tracks that has the minimum global cost, 
t.e. the smoothest frequency transitions for the entire set, which is markedly 
different from the greedy successive track selection algorithm discussed above. 
This method, which can be cast in the framework of hidden Markov models, 
has proven useful for sinusoidal modeling of complex sounds [43]. Furthermore, 
neural networks have been posed as a possible solution to the line tracking 
problem [1]; nonlinear methods have also proven useful for overcoming some of 
the difficulties in line tracking [239]. 
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Line tracking is sometimes considered part of the analysis rather than the 
synthesis. Then, the model includes a partial index or tag for each parameter 
set in each frame. The advantage of including this extra data in the representa- 
tion is that the reconstruction process is simplified such that the synthesis can 
meet real-time computation constraints. The inclusion is thus useful in cases 
where the analysis can be performed off-line; for instance, in audio distribution 
or in real-time signal modification, it is necessary to have a low-complexity syn- 
thesis, meaning that high-complexity operations such as line tracking should 
be lumped with the analysis if possible, even if it does require the inclusion of 
extra data in the parameterization. 


2.4.2 Parameter Interpolation 


After partial continuity is established by line tracking, it is necessary to interpo- 
late the frame-rate partial parameters {Aj,, Ww, ¢, } to determine the sample-rate 
oscillator control functions A,[n] and ©,|[n]. Typically, interpolation is done 
using low-order polynomial models such as linear amplitude and cubic total 
phase; the specific approach of [149] is presented here, but other interpolation 
methods have been considered [46, 148, 150, 184, 208]. The partial amplitude 
interpolation in synthesis frame 7 is a linear progression from the amplitude in 
analysis frame 7 to that in frame 7 + 1 and is given by 


ie) 
Aq,i[n] = Agi + (Ag,it1 ~ Ag,i) 3: (2.84) 


where n = 0,1,...,5 — 1 is the time sample index, and S is the length of the 
synthesis frame; this frame length is equal to the analysis stride LZ unless the 
analysis parameters are intermediately interpolated or otherwise modified to 
a different time resolution. This amplitude envelope plays a role in modeling 
sinusoids modulated by slowly varying amplitude envelopes; it was shown in 
Section 2.3.1 that such partials correspond to components that are not resolved 
by the DFT analysis. The phase interpolation for time-domain synthesis is 
given by 


O(n] = Og + win + agin” + Byin®, (2.85) 


where © and w enforce phase and frequency matching constraints at the frame 
boundaries, and a and # are chosen to make the total phase progression max- 
imally smooth [149]. Such phase and frequency matching constraints are ex- 
plored in greater detail in Section 2.5. 

Interpolation of the phase parameter is clearly more complex than the am- 
plitude interpolation. For efficient synthesis, then, it is of interest to consider 
more simple models of the phase. Indeed, the experimental observation that 
the auditory system is relatively insensitive to phase motivates the investigation 
of models based on amplitude envelopes and low-complexity phase evolution 
models, thus merging a waveform model with psychoacoustic phenomena in an 
effort to create a perceptually lossless model. For some signals, this so-called 
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magnitude-only reconstruction can be done transparently; however, transient 
distortion is increased when the phase is neglected. 

In the frequency-domain synthesis algorithm to be discussed in the next sec- 
tion, the parameter interpolation is not performed directly on the time-domain 
control functions, but is instead implicitly carried out by an overlap-add pro- 
cess which results in a pseudo-linear amplitude envelope and a transcendental 
phase interpolation function. These particular interpolation methods will be 
considered in detail, but the key issue regarding parameter interpolation can 
be made without reference to a specific interpolation scheme. Namely, recon- 
struction artifacts occur when the behavior of the signal does not match the 
interpolation model. This idea is revisited in Section 2.6. 


2.5 FREQUENCY-DOMAIN SYNTHESIS 


An alternative to time-domain synthesis using a bank of oscillators is frequency- 
domain synthesis, in which a representation of the signal is constructed in the 
frequency domain and the time-domain reconstruction is generated by an in- 
verse DFT and overlap-add process. This approach provides various compu- 
tational advantages over general time-domain synthesis [69, 201]. Frequency- 
domain synthesis was described in [149, 150, 221] and more fully presented in 
[201]. In this section, the algorithm of [201] is explored in detail. 


2.5.1 Spectral Synthesis of Partials 


The frequency-domain synthesis algorithm is fundamentally based on the re- 
lationship between the DTFT and the DFT and the resulting implications for 
representing short-time sinusoids. After a brief review of these issues, which are 
intrinsically connected to the matters discussed in Section 2.3.1, the synthesis 
algorithm is described. 


The DTFT, the DFT, and spectral sampling. For an N-point discrete- 
time sequence z|n] defined on the interval n € [0,N — 1], the discrete-time 
Fourier transform is defined as 


N-1 
X (el) = > a[nje“™". (2.86) 
n=0 
The DTFT is inherently 27-periodic, so the signal can be reconstructed from 


any DTF'T segment of length 27. For the specific interval [0,27], the equation 
for signal synthesis is 


ain] = = i "x (ei) dw, (2.87) 


where the interval simply provides the limits for the integral. 
The DTFT is a continuous frequency-domain function that represents a 
discrete-time function; for finite-length signals, there is redundancy in the 


68 ADAPTIVE SIGNAL MODELS 


DTFT representation. The redundancy can be reduced by sampling the DTFT, 
which is indeed necessary in digital applications. Sampling the DTFT yields 
a discrete Fourier transform if the samples are taken at uniformly spaced fre- 
quencies: 


X[k] = X (e)|,_aee = = Sain (2.88) 


For K = N, the sampled DTFT corresponds to a DFT basis expansion of z[n]. 
If K < N, the spectrum is undersampled and time-domain aliasing results as 
discussed in Section 2.2.1. On the other hand, the case K > N corresponds 
to oversampling of the spectrum; such oversampled DFTs were considered at 
length in Section 2.3.1 for the application of sinusoidal analysis. If K > N, the 
signal can be reconstructed exactly from the DTFT samples using the synthesis 
formula 


i anh 

z[n| = K > X [ke (2.89) 
k=0 

Representations at different spectral sampling rates have a simple relationship 


if the rates are related by an integer factor; introducing a subscript to denote 
the size of the DFT, 


ll 
P< 
—s~ 
14.) 
Se, 
€ 
eee” 


X[k] | nk (2.90) 
Xu[m] = X (e)| nm (2.91) 
M=pK => Xy[uk] = X(e™)|,_ ane = Xx[k]. (2.92) 


This relationship will come into play in the frequency-domain synthesis algo- 
rithm to be discussed. 

The underlying reason that reconstruction can be achieved from the samples 
of the DTFT is that the DTFT is by definition a polynomial function of order 
N-—1 (for asignal of length NV). Thus, any N samples specify the DTFT exactly, 
so the signal can in theory be reconstructed from any N or more arbitrarily 
spaced samples. However, due in part to its connection to the fast Fourier 
transform (FFT), the special case of uniform spectral sampling has been of 
greater interest than nonuniform sampling. 


Spectral representation of short-time sinusoids. To carry out frequency- 
domain synthesis, a spectral representation of the partials must be constructed. 
This construction is formulated here for the case of a single partial; the exten- 
sion to multiple partials is developed in the next section. 

A short-time sinusoid with amplitude A,, frequency wy, and phase ¢, can 
be written as 


pa{n] = b(n] Age? 2" + 9a), (2.93) 
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where b[n] is a window function of length N. In the frequency domain, this 
signal corresponds to 


P, (8) = B(e%”) * Agei?25[w — wg] = Age?**B (ci(e-4a)) , (2.94) 


where * denotes convolution and B(e) is the DIFT of the window b[n]. 
The spectrum of a short-time sinusoid windowed by b(n] is simply the window 
transform shifted to the frequency of the sinusoid. 

For synthesis based on an IDFT of size K, the appropriate amplitudes and 
phases for a K-bin spectrum must be determined. This discrete frequency 
model can be derived via spectral sampling of the DTFT in Equation (2.94): 


Py[k] = Age’** B (ei¥-¥»)) (2.95) 


? 
a 2th 
w= "K 


which corresponds to shifting the window transform B (e’”) to the continuous 
frequency w, and then sampling it at the discrete bin frequencies w, = 27k/K, 
where K > N is required to avoid time-domain aliasing. 

Using the above formulation, the short-time sinusoid p,[n] can be expressed 
in terms of the K-bin IDFT to be used for synthesis: 


Pq{n] = IDFTx { Ase! B (ees) ws} . (2.96) 
w= “6 


This representation is depicted in Figure 2.13 for three distinct cases: (1) an 
unmodulated Hanning window b[n], (2) modulation to a bin frequency of the 
DFT, and (3) modulation to an off-bin frequency. Note the location of the 
sample points with respect to the center of the main lobe in each of the cases; 
in the case of off-bin modulation in (3), the samples are asymmetric about 
the center. Also note that in (1) and (2) the only nonzero points in the DFT 
occur in the main lobe since the frequency-domain samples are taken at zero 
crossings of the DTFT sidelobes. All of the windows in the Blackman-Harris 
family exhibit this property by construction; it is not a unique feature of the 
Hanning window(99, 163]. In some applications, this zero-crossing property 
is useful in that a window can be applied efficiently in the DFT domain by 
circular convolution [99]. 


Spectral motifs. In Equation (2.95) the spectral representation of a short- 
time sinusoid of frequency wo is computed by evaluating B (e%”) at the frequen- 
cies 2tk/K — wo. This computation is prohibitively expensive with regards to 
real-time synthesis, however, so it is necessary to precompute and tabulate 
B (e) (69, 201]. Such tabulation requires approximating B (e’”) in a discrete 
form; this approximation, which will be referred to as a spectral mot#f [201], is 
considered here. 

A sinusoid at any frequency wy can be represented in the form given in 
Equation (2.95). This unrestricted resolution is achieved since B (e’”) is a 
continuous function and spectral samples can be taken at arbitrary frequencies 
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Figure 2.13. A depiction of frequency-domain sampling for spectra of short-time sinusoids. 
The continuous spectra are the DTFTs of the modulated window functions and the circles 
indicate the spectral samples corresponding to their DFTs. Case (1) is the unmodulated 
Hanning window b[n], case (2) involves modulation to the bin frequency 27k /K for k = 2 
and K = 16, and case (3) involves modulation to the off-bin frequency corresponding to 

= 2.4. Note that for k = 0 and k = 2, the DFT of the Hanning window consists of 
only three nonzero points. 


2xk/K —w,. In a discrete setting, such resolution can be approximated by 
representing B (ce!) using a highly oversampled DFT of size M >> K; in this 
framework, the spectral motif is 

Bim] = B(e™)|,,_ 2. 


m 
M 


(2.97) 


Using such a motif, a sinusoid of frequency wy = 27m, /M can be represented 
exactly in a K-bin spectrum if M is an integer multiple of K, say M = pK: 


P,[k] = Aj,e??: B (iter) ) oe (2.98) 
W= 

= A,eitB (e! ar") ) (2.99) 

= A,ei*B (ei( 54" ar") ) (2.100) 


= A,e* Buk — my]. (2.101) 
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Figure 2.14. Spectral motifs in the frequency-domain synthesizer. The motif is the over- 
sampled main lobe of the DTFT of some window b[n], which is precomputed and stored. 
To represent a partial, the motif is modulated to the partial frequency and then sampled at 
the bin locations of the synthesis IDFT as shown in (b). If the modulation does not align 
with the motif samples, the tabulated motif can be interpolated. 


In this way, a spectral representation of a short-time sinusoid is constructed 
not by directly sampling the DTFT but by sampling the motif, which is itself a 
sampled version of the DT FT. The frequency resolution of the synthesis is thus 
limited not by the size of the synthesis IDFT but by the oversampling of the 
motif. In some other incarnations of frequency-domain synthesis, large IDFTs 
are required to achieve accurate frequency resolution [149, 150, 221]; in this 
algorithm, arbitrary resolution of the synthesis frequencies can be achieved by 
increasing the factor , provided that enough memory is available for storage 
of the motif. In audio applications, the resolution limits of the auditory system 
can of course be taken into account in choosing the oversampling [201]. 
Figure 2.14 gives an example of a spectral motif and depicts the resolution 
issues discussed above. Note that if the frequency of a partial cannot be written 
as 27m,/M for some integer m,, the samples in the shifted motif will not align 
with the bins of the synthesis IDFT. To account for this, partial frequencies can 
be rounded; alternatively, linear or higher-order interpolation can be applied to 
the motif if enough computation time is available. These techniques allow for 
various tradeoffs between the frequency resolution, the motif storage require- 
ments, and the computational cost. Beyond the issue of frequency resolution, 
a further approximation in the motif-based implementation is also indicated in 
Figure 2.14. Namely, only the main lobe of B (e%”) is tabulated; the sidelobes 
are neglected. The result of this approximation is that the spectral represen- 
tation does not correspond exactly to a sinusoid windowed by b[n]; also, each 
modulation of the motif actually corresponds to a slightly different window. In 
practice, these errors are negligible if the window is chosen appropriately [201]. 
In sinusoidal analysis, it is typically assumed that each lobe in the short-time 
spectrum of the signal corresponds to a partial. Various caveats involving this 
assumption were examined in Section 2.3.1; these are not considered further 
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here. It should be noted, though, that frequency-domain synthesis involves an 
analogous concept: a partial can be synthesized by inverse transforming an 
appropriately constructed spectral lobe. 


Accumulation of partials. Since the DTFT and the DFT are linear oper- 
ations, the spectrum of the sum of partials for the signal model can be con- 
structed by accumulating their individual spectra. Denoting the DTFT for the 
i-th synthesis frame as X (e!”,i), and using the subscript i to denote the frame 
to which a partial parameter corresponds, the accumulation of partials for the 
i-th frame is given by 


Q Q 
X (ei) = So Pai (e™”) = 55 Aq ie??a' B (ei(e-va.)) (2.102) 
q=1 q=1 
. Q . 
= Bei) *S~ Ay jel? dw — was), (2.103) 
q=1 


which corresponds in the time domain to 


Q 
Bi[n] = bln] d— Ag cet ain toes), (2.104) 


q=1 


which is simply a windowed sum of sinusoids. If K > N, the reconstruction 
£;[n] can be generated from the sampled spectrum 


(2.105) 


Q 
K(k’) = X (1) = DoAgieteet B(eewad)| 
K q=1 K 

using an IDFT, implemented as an inverse FFT (IFFT) for computational effi- 
ciency. This formulation shows that a K-bin spectrum for synthesis of a signal 
segment can be constructed by accumulating sampled versions of a modulated 
window transform. The result in synthesis is then the sum of sinusoids given 
in Equation (2.104). To synthesize a sum of real sinusoids, the K-bin spectrum 
can be added to a conjugate-symmetric version of itself prior to the IDFT; note 
that the window b[n] is assumed real. 

As discussed in the previous section, the window transform is represented 
using a spectral motif. These motifs are modulated according to the partial 
frequencies from the analysis, and weighted according to the partial amplitudes 
and phases. The approximations made in the motif representation lead to 
some errors in the synthesis, though; namely, the motifs for each partial do not 
exactly correspond to modulated versions of b[n], so the synthesized segment 
is not exactly a windowed sum of sinusoids. This error can be made negligible, 
however, by choosing the window appropriately. Noting that the window )/n| 
is purely a byproduct of the spectral construction, and that it is not necessarily 
the window used in the sinusoidal analysis, it is evident that the design of b[n] 
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is not governed by perfect reconstruction conditions or the like. Rather, b[n] 
can be chosen such that its energy is highly concentrated in its main spectral 
lobe; then, neglecting the sidelobes does not introduce substantial errors. Other 
considerations regarding the design of b[n] will be indicated in the next section. 


Overlap-add synthesis and parameter interpolation. Given a series 
of short-time spectra constructed from sinusoidal analysis data as described 
above, a sinusoidal reconstruction can be carried out by inverse transforming 
the spectra to create a series of time-domain segments and then connecting 
these segments with an overlap-add process. This process has distinct ram- 
ifications regarding the interpolation of the partial parameters. Whereas in 
time-domain synthesis the frame-rate data is explicitly interpolated to create 
sample-rate amplitude and phase tracks, in this approach the interpolation is 
carried out implicitly by the overlap-add process. For reasons to be discussed, 
it is important to note that the OLA can be generalized to include a second 
window v[n] in addition to b[n]; this window v[n] is applied to the output of the 
IDFT prior to overlap-add such that the OLA is in effect carried out with the 
product window ¢[n] = b[n]v[n]. Assuming ¢[n] is of length N and a stride of 
L = N/2 is used for the OLA, the synthesis of a single partial for one overlap 
region can be expressed as 


t[n] Ape? or +9o) 4. tin — L] Aye7 wi (m-2) +1) (2.106) 


where the subscripts 0 and 1 are frame indices, and the subscript q has been 
dropped for the sake of neatness; the offset of ZL in the second term serves to 
adjust its time reference to the start of the window t[n — L]. The contributions 
from the two frames can be coupled into a single magnitude-phase expression; 
the amplitude evolution of the magnitude-phase form is given by 


A[n] = 4/Aé@t[n]? + A?t(n —L]? + 2ApAjt[n]t(n —L]cosQ (2.107) 


where 
Q = (wo —w1)n+u,L+4+ do — hi, (2.108) 
and the phase function O[n] is 


Aot[n] sin(won + po) + Ait[n — L] sin(u,(n — L) + ¢1) 


arctan PACE Ee SES ESTES OLE 


]. 20» 
The region where these functions apply is n € [L, N], namely the second half of 
the window ¢[n] and the first half of t{n — L]. The OLA interpolation functions 
are clearly more complicated than the low-order polynomials used in time- 
domain synthesis. The complications arise in part because the amplitude and 
frequency evolution are not decoupled as in the time-domain case. 

For an evolving partial, the sinusoidal parameters change from frame to 
frame. The reconstruction in the overlap region is thus generally a sum of two 
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Figure 2.15. Overlap-add with a triangular window provides linear amplitude interpolation 
if the partial frequencies in adjacent frames are equal. Plot (a) shows a triangular-windowed 
partial of amplitude 1 in synthesis frame 2, plot (b) shows a partial of amplitude 2 in synthesis 
frame 71+ 1, and plot (c) shows the linear amplitude interpolation resulting from overlap-add 
of the two frames. 


sinusoids of different amplitudes and frequencies. In the OLA interpolation, 
this parameter difference results in amplitude distortion due to the beating 
of the different frequencies; furthermore, it results in a transcendental phase 
function. The parameter interpolation functions in OLA are dealt with further 
in Section 2.5.2. Here, the discussion will be limited to choosing the synthesis 
window t[n]. This choice will be motivated by adhering to the case of slow signal 
evolution, where the parameters do not change drastically from one synthesis 
frame to the next; specifically, the treatment will adhere to the limiting case 
in which the frequency parameter is assumed constant across frames: wp = w). 
This heuristic, coupled with the phase-matching assumptions to be discussed 
later, leads to a simplification in the amplitude interpolation: 


A[n] = Apot[n] + Ajt[n — L). (2.110) 


If t[n] is chosen to be a triangular window of length N, this overlap-add sum 
provides linear amplitude interpolation as shown in Figure 2.15. This feature 
is desirable since it enables the frequency-domain synthesizer to perform simi- 
larly to the time-domain method while taking advantage of the computational 
improvements that result from using the IFFT for synthesis [69, 201]. 

For the overall OLA window t[n] to be a triangular window, the hybrid 
window v[n] = t[n]/b[n] must be applied to the IDFT output prior to overlap- 
add. Thus, the quotient v[n] = ¢[n]/b[n] must be well-behaved in order for 
the synthesis to be robust. While v/n] is theoretically a perfect reconstruction 
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Figure 2.16. Overlap-add windows in the frequency-domain synthesizer. The plots in the 
right column shown t[n]/b[n] when b[n] is a Hanning window, a Hamming window, and a 
Blackman-lll window, respectively. 


window for this OLA process, finite precision effects may lead to significant 
errors in the reconstruction if v[n] has discontinuities due to zeros in b[n], for 
instance. Example of such hybrid windows are given in Figure 2.16 for the case 
of a Hanning window, a Hamming window, and a Blackman-III window [99]; 
this shows that a Hanning window is actually unsuitable for this application 
given the discontinuities at the edges of the hybrid window. 


Frequency-domain synthesis and the STFT. It was shown in Section 
2.2.1 that the STFT synthesis can be interpreted as an inverse Fourier transform 
coupled with overlap-add. Likewise, the IDFT/OLA process in the frequency- 
domain synthesizer can be interpreted as an STFT synthesis filter bank. This 
point of view leads to yet another variation of the block diagrams given in 
Figures 2.6 and 2.7. In this interpretation, a parametric model is incorporated 
across all of the bands in the analysis bank as in the sinusoidal model of Figure 
2.7; this parametric model includes the sinusoidal analysis and the construction 
of short-time spectra from the analysis data. Then, the short-time spectra serve 
as input to a synthesis filter bank, which replaces the oscillator bank used in 
time-domain sinusoidal synthesis; the filters in the bank are given by g;(n] = 
v[n]ei“*” where v[n] is the hybrid window discussed earlier. This interpretation 
of the IDFT/OLA method is depicted in Figure 2.17; the structure is similar 
to that used in the STFT modifications discussed in Section 2.2.2. 


2.5.2 Phase Modeling 


In the time-domain synthesizer, low-order polynomial models are used to in- 
terpolate the frame-rate parameters to derive sample-rate amplitude and total 
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Figure 2.17. Block diagram of frequency-domain synthesis for sinusoidal modeling. The 
parametric model includes the sinusoidal analysis and the construction of short-time spectra 
from the analysis data. The IDFT/OLA process in the frequency-domain synthesizer can be 
interpreted as an STFT synthesis filter bank. 


phase functions; this interpolation is carried out explicitly for each partial iden- 
tified by the line tracking algorithm. In contrast, in the frequency-domain syn- 
thesizer the parameter interpolation is carried out implicitly by the overlap-add 
process without reference to any line tracking method. Line tracking is only 
required for synthesis if a model of partial continuity is needed for intermediate 
signal modifications or if the signal is to be reconstructed from the amplitude 
data only. The latter case is discussed here. 


Magnitude-only reconstruction and amplitude distortion. Compres- 
sion can be achieved in the sinusoidal model by discarding the phase data. Such 
compaction is justifiable in audio applications given the heuristic notion that 
the ear is insensitive to phase; high-fidelity synthesis can be achieved using only 
the amplitude and frequency information from the analysis. Such magnitude- 
only reconstruction, however, relies on imposing sensible phase models that 
take the frequency evolution into account. In the frequency-domain synthe- 
sizer, for instance, ignoring phase relationships in adjacent frames can lead to 
significant amplitude distortion; consider Equation (2.109) for the simple case 
Ag = Ai = 1 with zero phase ¢p = ¢; = 0: 


A{n] = Ven}? + t[n— LD}? + 2¢[n]t[n — L] cos(uw LZ). (2.111) 


The cosine term in this expression can result in highly distorted amplitude 
envelopes as shown in Figure 2.18. Note that equal amplitudes leads to a worst 
case scenario since the interfering signals can cancel each other exactly at the 
midway point in the overlap region. 
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Figure 2.18. Plot (a) shows the ideal amplitude envelope for overlap-add with equal 
amplitudes in adjacent frames; the underlying triangular windows are also shown. Plot (b) 
shows examples of the amplitude distortion that occurs in the overlap region due to phase 
mismatch; this example is specifically for the case of frequencies that are equal in adjacent 
frames as formulated in Equation (2.107), but the effect is general as discussed in the text. 
In the plot, the phase mismatch w,L ranges from Q to 7; for a mismatch of 7, the signals 
cancel exactly at n = 3/4, halfway through the overlap region. 


Phase matching. The example in Figure 2.18 shows that neglecting the 
phase can lead to significant distortion in the OLA synthesis; synthesis with 
zero phase can result in substantial destructive interference. It is thus necessary 
to impose a phase model to avoid amplitude distortion in the reconstruction. 
One approach to limiting the distortion is to match the phases of the interfering 
sinusoids halfway through the overlap region. This constraint is given by 


od: = do two (*) — Wy (4) ; (2.112) 


where N = 2L is the frame size. If this phase matching is used, the amplitude 
envelope, in the equal-amplitude case, becomes a function of the inter-frame 
frequency difference wo — w: 


Al[n] = 


t[n]? + t[n — L]? + 2t[n]t[n — L] cos lu — W) (x _ al . (2.113) 


Examples of this amplitude distortion are given in Figure 2.19(a) for |wo9—uw,| = 
Ar/N with A € [0,5] and N = 512; the corresponding overlap-add phase 
function O[n] is given for A € [0,1,5] in Figures 2.19(b,c,d). Note that the 
amplitude distortion increases as the frequency difference increases and that 
the phase function is well-behaved, especially for n = 0, where it is linear as 
expected, and for n = 1, where the nonlinearity introduced by the frequency 
change is not pronounced. 

To limit the synthesis amplitude distortion characterized in Equation (2.113) 
and Figure 2.19(a), N can be chosen such that frequency differences in typical 
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Figure 2.19. Parameter interpolation in overlap-add with phase matching. The amplitude 
distortion in overlap-add is reduced if phase matching is used. If the frequencies in adjacent 
frames are equal, there is no amplitude distortion and linear interpolation is achieved. In (a), 
the amplitude distortion is plotted for inter-frame frequency differences for Ama /N, where 
A € [0,5] and N = 512. The distortion increases as the frequency difference increases. 
In plots (b,c,d), the OLA phase function is given for various values of A for wo = 57 /N 
and dp = 0; the phase is well-behaved. 


signals do not lead to significant artifacts. If N is chosen such that 


max |w9,i —W4,i+1| < (2.114) 


T 
N’ 
all frames, 


all partials 


the maximum deviation of the envelope from unity will be less than 2% in 
the worst case scenario of equal-amplitude partials. For the case N = 512, a 
440 Hz partial at a sampling rate of 44.1kHz can double in frequency in about 
10 frames, roughly 60ms, without significant distortion being introduced; this 
rate is suitable for high-quality music synthesis. 

As stated earlier, the OLA process does not require line tracking if the ampli- 
tude and phase data from the analysis are both incorporated in the synthesis. 
Unlike time-domain synthesis, which requires tracks for interpolation, OLA 
carries out interpolation without reference to the signal continuity. However, 
in cases where compression is achieved by discarding the phase data, it is nec- 
essary to use a line tracking algorithm to relate the partials in adjacent frames 
so that phase matching can be applied; in synthesis based on magnitude-only 
representations, a phase model must be incorporated to mitigate distortion. 


Frequency matching and chirp synthesis. In addition to phase matching, 
the synthesis frequencies in adjacent frames can be matched in the overlap re- 
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gion. Such frequency matching can be carried out by synthesizing chirps in each 
frame instead of constant-frequency sinusoids; the chirp rates are determined 
by a frequency-matching criterion [86, 87]. The caveat here is that the motif 
must be adjusted to represent a chirp instead of a fixed frequency sinusoid; 
this is done by precomputing a motif for various chirp rates and interpolating 
in the precomputed table [87]. Such chirp synthesis, however, has not proved 
necessary for synthesis of natural signals, so the added cost of tabulation and 
interpolation is not readily justified. Of course, this conclusion depends on the 
length of the synthesis windows; if the windows are short enough, the frequency 
variations from frame to frame will be small and will not lead to distortion. In 
a frequency-domain synthesizer with windows on the order of 5 ms long, the 
phase matching described above is sufficient for removing perceptible amplitude 
distortion in the reconstruction of natural signals. 

In Section 2.3.1, the issue of orthogonality of the synthesis components was 
discussed. Orthogonality was argued to be desirable to avoid destructive inter- 
action in the superposition of components in the signal model; this issue was 
considered using a geometric framework. Phase modeling can be interpreted in 
a similar light; considering the windowed partials in adjacent frames as vectors, 
the phase matching process aligns these vectors in the signal space such that 
they add constructively instead of destructively. 


2.6 RECONSTRUCTION ARTIFACTS 


As discussed in Section 1.5.1, the analysis-synthesis procedure for any signal 
model has fundamental resolution limits. In the case of the sinusoidal model, 
the resolution is basically limited by the choice of the frame size and the analysis 
stride. For long frames, the time resolution is inadequate for capturing signal 
dynamics such as attack transients; for short frames, on the other hand, the fre- 
quency resolution is degraded such that identification of sinusoidal components 
in the spectrum becomes difficult. The sinusoidal model is thus governed by 
the same fundamental resolution limits as any time-frequency representation. 

In compact models, limitations in time-frequency resolution tend to result 
in artifacts in the reconstruction. As a result, the analysis-synthesis process 
yields a nonzero residual. The components of the residual include errors made 
by the analysis or the synthesis as well as artifacts resulting from shortcomings 
in the model. In the sinusoidal model, such errors occur if the original signal 
does not behave in the manner specified by the parameter interpolation used 
in the synthesis. In addition to the noiselike components discussed in Section 
2.1.2, then, the residual in the sinusoidal model contains such model artifacts. 

In Section 1.1.2, the perceptual importance of preserving note attacks in 
music synthesis was discussed. With this in mind, the sinusoidal model artifact 
that will be focussed on here is pre-echo distortion of signal onsets. This issue 
was introduced in the example of Figure 2.8; additional examples involving 
simple synthetic signals are given in Figure 2.20. 

Pre-echo in the sinusoidal model is generated by the following mechanism. 
Before the signal onset, there is an analysis frame in which the signal is not 
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Figure 2.20. Pre-echo in the sinusoidal model for two synthetic signals: (a) a simple 
sinusoid, and (b) a harmonic series. Plots (c) and (d) depict the delocalized reconstructions, 
and plots (e) and (f) show the residuals. Note the pre-echoes and the artifacts near the 
onset times. The model uses an analysis window of length 1024 and a stride of 512. 


present and no sinusoids are found. In the next frame, i.e. the one in which 
the signal onset occurs, various spectral peaks are identified and modeled as 
sinusoids. The line tracking algorithm interprets these partials as births and 
forms a track connecting them to zero-amplitude partials in the previous frame, 
where no spectral peaks were detected. In the reconstruction, then, each of the 
partials in the onset is synthesized with a linear amplitude envelope as specified 
by the parameter interpolation model. The result is that the onset is spread 
into the preceding frame. In general, the birth (or death) of a partial in any 
given frame is delocalized in this manner; in an attack, however, the effect 
is dramatic because all of the partials are treated in this way simultaneously. 
It should be noted that the frame-rate parameters derived by the sinusoidal 
analysis can be interpolated to a different rate to achieve data reduction or to 
match some rate required by the synthesis engine; this process, however, results 
in additional artifacts due to the smoothing carried out by the interpolation. 

The linear amplitude envelope for a partial onset is clearly visible in the 
single sinusoid example of Figure 2.20(a,b,c). This example shows not only the 
delocalization of the attack, but also the introduction of a significant artifact 
in the residual. Figure 2.20(d,e,f) shows the pre-echo in the sinusoidal model 
of a harmonic series with three terms; this illustration is given as a precursor 
to a more complex example involving a natural signal, namely the attack of 
a saxophone note given in Figure 2.21. The delocalization of the attack de- 
grades the synthesis realism and also introduces an artifact in the residual. 
These issues will be discussed in detail in the following two chapters; Chapter 3 
presents multiresolution extensions of the sinusoidal model intended to improve 
the localization of transients, and Chapter 4 discusses modeling of the residual. 
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Figure 2.21. Pre-echo in the sinusoidal model for a saxophone note: (a) the original, (b) 
the reconstruction, and (c) the residual. 


One approach for preventing reconstruction artifacts is the method described 
in [76], which accounts for the attack problem by separately modeling the over- 
all amplitude envelope of the signal. The amplitude envelope is applied to the 
sinusoidal reconstruction to improve the time localization. This representation, 
however, is nonuniform in that it relies on independent parametric models of 
the envelope and the sinusoidal components. Chapter 3 discusses methods that 
improve the localization without altering the uniformity of the representation. 


2.7 SIGNAL MODIFICATION 


Modifications based on the short-time Fourier transform were discussed in Sec- 
tion 2.2; the difficulty of modifications in such a nonparametric representation 
was one of the motivations for revamping the STFT into the parametric sinu- 
soidal model. Here, modifications based on the sinusoidal model are dealt with 
more explicitly. Specifically, time-scaling, pitch-shifting, and cross-synthesis 
are considered. The treatment here is quite general; formalized details about 
modifications in a specific version of the sinusoidal model can be found in the lit- 
erature (76, 77, 185, 188]. Note that the point of this section is not to introduce 
novel signal modifications, but rather to emphasize that such modifications can 
be easily realized using the sinusoidal model because of its parametric nature. 


2.7.1 Denoising and Enhancement 


The application of denoising deserves mention here inasmuch as the denoising 
process can be viewed as a signal modification. As discussed, the sinusoidal 
model is ineffective for representing broadband processes. This shortcoming 
motivates the inclusion of the stochastic component proposed in [208] to account 
for musically relevant stochastic features such as breath noise in a flute or bow 
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noise in a violin; these must be incorporated if realistic synthesis is desired. This 
idea, however, assumes that the original signal is a clean recording of a natural 
instrument. For noisy recordings, the sinusoidal model residual contains both 
the noise and the desired stochastic signal features; unless these two processes 
can be separated, this type of residual is not useful for enhancing the signal 
realism. In these cases, it is generally more desirable to not incorporate the 
residual in the synthesis; in this way, the signal can be denoised via sinusoidal 
modeling. In addition to denoising, the sinusoidal model has been used for 
speech enhancement and dynamic range compression [113, 187]. 


2.7.2. Time-Scaling and Pitch-Shifting 


In Section 1.5.1, it is proposed that signal modifications can be carried out by 
modifying the components of a model of the signal. The sinusoidal model is 
particularly amenable to this approach because the modifications of interest are 
easy to carry out on sinusoids. For instance, it is simple to increase or decrease 
the duration of a sinusoid, so if a signal is modeled as a sum of sinusoids, it 
becomes simple to carry out time-scaling on the entire signal. One caveat to 
note is that in some time-scaling scenarios it is important to preserve the rate 
of variation in the amplitude envelope of the signal, 3.e. the signal dynamics, 
but this can be readily achieved. This issue is related to the time-scaling 
of nonstationary signals, in which some signal regions should be time-scaled 
and some should be left unchanged; for example, for a musical note, which 
can be most simply modeled as an attack followed by a sustain, time-scale 
modifications are most perceptually convincing if the time-scaling is carried 
out only for the sustain region and not for the attack. 

Time-scale modifications can also be carried out using approaches tradition- 
ally referred to as nonparametric [158]. These involve either STFT magnitude 
modification followed by phase estimation as discussed earlier, or analyzing the 
signal for regions, e.g. pitch periods, which can be spliced out of the signal for 
time-scale compression or repeated for time-scale expansion. Computational 
cost and quality comparisons between such approaches and modifications using 
the sinusoidal model have not been formally presented, but this is an area of 
growing interest in the literature and in the electronic music industry [124]. 

The sinusoidal model allows a wider range of modifications than standard 
music synthesizers such as samplers, where the signal is constructed from stored 
sound segments and modifications are limited by the sample-based representa- 
tion. For instance, time-scaling in samplers is carried out by upsampling and 
interpolating the stored segments prior to synthesis, but this process is accom- 
panied by a pitch shift. The sinusoidal model can readily achieve time-scaling 
without pitch-shifting, or the dual modification of pitch-shifting without time- 
scaling. For instance, a simple form of pitch modification can be carried out 
by scaling the frequencies prior to synthesis. For voice applications, however, 
this results in unnatural reconstructed speech. Natural pitch transposition can 
be achieved by interpreting the sinusoidal parameterization as a source-filter 
model and carrying out formant-corrected pitch-shifting as described below. 
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Formant-corrected pitch-shifting. The sinusoidal model parameterization 
includes a description of the spectral envelope of the signal. This spectral en- 
velope can be interpreted as as a time-varying filter in a source-filter model 
in which the source is a sum of unweighted sinusoids. In voice applications, 
the filter corresponds to the vocal tract and the source represents the glottal 
excitation. This analogy allows the incorporation of an important physical 
underpinning, namely that a pitch shift in speech is produced primarily by a 
change in the rate of glottal vibration and not by some change in the vocal 
tract shape or its resonances. To achieve natural pitch-shifting of speech or 
the singing voice using the sinusoidal model, then, the spectral envelope must 
be preserved in the modification stage so as to preserve the formant structure 
of the vocal tract. The pitch-scaling is carried out by scaling the frequency 
parameters of the excitation sinusoids and then deriving new amplitudes for 
these pitch-scaled sinusoids by sampling the spectral envelope at the new fre- 
quencies, using interpolation of the envelope for frequencies that do not fall on 
the spectral bins. This approach allows for realistic pitch transposition. 


Spectral manipulations. In addition to formant-corrected pitch-shifting, 
the source-filter interpretation of the sinusoidal model is useful for a variety 
of spectral manipulations. In general, any sort of time-varying filtering can be 
carried out by appropriately modifying the spectral envelopes in the parametric 
sinusoidal model domain. For instance, the formants in the spectral envelope 
can be adjusted to yield gender modifications; by moving the formants down in 
frequency, a female voice can be transformed into a male voice, and vice versa 
[241]. Also, the amplitude ratios of odd and even harmonics in a pitched signal 
can be adjusted. These modifications are related to cross-synthesis methods, 
which are considered further in the following section. 


2.7.3 Cross-Synthesis and Timbre Space 


Time-scaling and pitch-shifting modifications are operations carried out a single 
original signal; cross-synthesis, as described in Section 1.2.4, refers to methods 
in which a new signal is created via the interactions of two or more original 
signals. A common example of cross-synthesis is based on source-filter models 
of two signals; useful mixture signals can be derived by using the source from 
one model and the filter from the other, for instance exciting the vocal tract 
filter estimated from a male voice by the glottal source estimated from a female 
voice. Such cross-synthesis has been experimented with in music recording and 
performance; one of the early examples of cross-synthesis in popular music, 
mentioned in Chapter 1, is the cross-synthesized guitar in [67], in which the 
signal from an electric guitar pickup is used as an excitation for a vocal tract 
filter, resulting in a guitar sound with a speech-like formant structure, the 
percept of which is a “talking” guitar. 

Parametric representations enable a wide class of cross-synthesis modifica- 
tions. This notion is especially true in the sinusoidal model since the parameters 
directly indicate musically important signal qualities such as the pitch as well as 
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the shape and evolution of the spectral envelope. One immediate example of a 
modification is interpolation between the sinusoidal parameters of two sounds; 
this yields a hybrid signal perceived as a coherent merger of the two original 
sounds, and not simply a cross-fade or averaging. This type of modification 
has recently received considerable attention for the application of image mor- 
phing, which is carried out by parameterizing the salient features of an original 
image and a target image (such as edges or prominent regions) and creating 
a map between these parametric features that can be traversed to construct a 
morphed intermediate image [245]. Such morphing has also been used carry 
out audio modifications based on the parametric representation provided by 
the spectrogram, which is the squared magnitude of the STFT [213]. 

In the fields of psychoacoustics and computer music, it has been of interest to 
categorize instrumental sounds according to their location in a perceptual space. 
For instance, the clarinet and the bassoon would be fairly close together in this 
space, while the piano or guitar would not be nearby. Such categorization is 
referred to as multidimensional scaling (155, 199, 242]. It has been observed that 
timbre, which corresponds loosely to the evolution and shape of the spectral 
envelope, is an important feature in subjective evaluations of the similarity 
of sounds; if two sounds have the same timbre, they are generally judged to 
be similar [242]. Because the parameters of the sinusoidal model capture the 
behavior of the spectral envelope, ?.e. the timbre of the sound, the sinusoidal 
representations of various sounds can be used to situate the sounds in a timbre 
space, which can then be explored in musically meaningful ways by interpolating 
between the parameter sets. This interpretation of a parametric timbre space 
as a musical control structure is of interest in computer music [242]. 


2.8 CONCLUSION 


In this chapter, the nonparametric short-time Fourier transform was discussed 
extensively. It was shown that the STFT can be interpreted as a modulated 
filter bank in which the subband signals can be likened to the partials in a si- 
nusoidal signal model. It was further shown that more compact models can be 
achieved by parameterizing these subband signals to account for signal evolu- 
tion. This idea is fundamental to the sinusoidal model, which can be viewed as 
a parametric extension of the STFT; incorporating such parameterization leads 
to signal adaptivity and compact models. Various analysis issues for the sinu- 
soidal model were considered, and both time-domain and frequency-domain 
synthesis methods were discussed. Since the sinusoidal model is parametric, 
any of these analysis-synthesis methods inherently introduce some reconstruc- 
tion artifacts, but these come with the benefits of compaction and modification 
capabilities. Minimization of such artifacts by multiresolution methods is dis- 
cussed in Chapter 3, and modeling of the residual is examined in Chapter 4. 


3 MULTIRESOLUTION SINUSOIDAL 
MODELING 


The failure of not having seen is nothing like 
the fatlure of not having looked. 


— Pamela Alexander, “The Catacombs Again” 


As indicated in the previous chapter, the standard sinusoidal model has 
difficulty modeling broadband processes — both noiselike components and time- 
localized transient events such as attacks. Such broadband processes thus ap- 
pear in the residual of the sinusoidal analysis-synthesis. A perceptual model 
for noiselike components will be presented in Chapter 4; that representation, 
however, is inadequate for time-localized events such as attack artifacts, so it is 
necessary to consider ways to prevent these events from appearing in the resid- 
ual. In this chapter, the sinusoidal model is reinterpreted in terms of expansion 
functions; the structure of these expansion functions both indicates why the 
model breaks down for time-localized events and suggests methods to improve 
the model by casting it in a multiresolution framework. Two approaches are 
considered: applying the sinusoidal model to filter bank subbands, and using 
signal-adaptive analysis and synthesis frame sizes. These specific methods are 
discussed after a consideration of multiresolution as exemplified by the discrete 
wavelet transform. 


3.1 ATOMIC INTERPRETATION OF THE SINUSOIDAL MODEL 


The partials in the sinusoidal model can be interpreted as expansion functions 
that comprise an additive decomposition of the signal; this perspective pro- 
vides a conceptual framework for several treatments of sinusoidal modeling in 
the literature (144, 145, 146]. With this notion as a starting point, the sinusoidal 
model is here interpreted as a time-frequency atomic decomposition. This in- 
terpretation sheds some light on the fundamental modeling issues and indicates 
a connection between sinusoidal modeling and granular analysis-synthesis. 
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3.1.1 The Standard Sinusoidal Model 


The atomic interpretation of the sinusoidal model stems from considering the 
frame-to-frame nature of the approach. The model given in Equation (2.1), 
namely 


Q[n] Q[n] 
z[n] = &[n] = S— paln] = > Ag in] cos O,[n], (3.1) 


can be recast into an expression that incorporates the synthesis frames, which 
are indexed by the subscript 7: 


Q[n] Q[n] 
sin] © ain] = So pain] = > Py sle (3.2) 
Q{n] 
= 32> Ag slr] cos O,,;[n], (3.3) 


where p,,;[n] denotes the time-limited portion of the q-th partial that cor- 
responds to the j-th synthesis frame. The time-domain sinusoidal synthesis 
can thereby be viewed as a concatenation of non-overlapping synthesis frames, 
each of which is a sum of localized partials. Each of the components py,;|n] 
in Equation (3.2) is time-localized to a synthesis frame and frequency-localized 
according to the function 0,,;[n]. Thus, a sinusoidal model of a signal can be 
interpreted as an atomic decomposition given by 


sin] ~ Spyaln], where pyj[n] = Agj[n]cos@,;[n] (3-4) 


as indicated in Equation (3.2). 

Consider the typical partial A,|n]cos©,|n] depicted in Figure 3.1, which 
has linear amplitude and cubic total phase as in the approach of [149]. The 
corresponding atomic decomposition is depicted in Figure 3.2; for these specific 
interpolation methods, the sinusoidal model derives a signal expansion in terms 
of atoms with linear amplitude and cubic phase. It is important to note that 
the atoms are generated using parameters extracted from the signal and are 
thus signal-adaptive. In this sense, the sinusoidal model can be interpreted as 
a method of granular analysis-synthesis; by its parametric nature, it overcomes 
the limitations of the STFT or phase vocoder with respect to granulation. 

In this atomic interpretation of the sinusoidal model, the atoms are con- 
nected from frame to frame in accordance with a notion of signal continuity 
or evolution. This connectivity results in partials that persist meaningfully in 
time. The atoms are not disparate events in time-frequency but rather inter- 
locking pieces of a cohesive whole. 
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Figure 3.1. A typical partial (a) in the sinusoidal model with linear amplitude (b) and 
cubic total phase (c). 
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Figure 3.2. The partial depicted in Figure 3.1 can be decomposed into these linear am- 
plitude, cubic phase time-frequency atoms. This decomposition suggests an interpretation 
of the sinusoidal model as a method of granular analysis-synthesis in which the grains are 
connected in an evolutionary fashion. 


3.1.2 Multiresolution Approaches 


The atomic interpretation of the sinusoidal model indicates why the model has 
difficulties representing transient events such as note attacks. Each atom in 
the decomposition spans an entire synthesis frame; the time support or span is 
the same for every atom. The result of this fixed resolution is that events that 
occur on short time scales are not well-modeled; this problem is analogous to 
the difficulty that a Fourier transform has in modeling impulsive signals. In 
addition to the limitations that result from the fixed time support of the atoms, 
however, the sinusoidal model also has time-localization limitations because 
of the frame-to-frame interpolation of the partial parameters as discussed in 
Section 2.6. The sinusoidal model delocalizes transient events in two ways: a 
transient is spread across a synthesis frame because of the fixed time resolution 
of the model expansion functions; also, a transient bleeds into neighboring 
frames due to the interpolation process. 

The time-localization shortcomings of the sinusoidal model can be remedied 
by applying a multiresolution framework to the model. Fundamentally, such 
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approaches are motivated by the atomic interpretation of the model: atoms 
with constant time support are inadequate for representing rapidly varying 
signals, so it is necessary to admit atoms with a variety of supports into the 
decomposition. To this point it has been implied that shorter atoms are of 
interest, but it should be noted that in some cases it is also useful to lengthen 
the time support of the atoms. In regions where a signal is well-modeled by a 
sum of sinusoids, lengthening the frames improves the frequency resolution of 
the analysis and can thus improve the model; incorporating a diverse set of time 
supports allows for flexible tradeoffs between time and frequency resolution. 
Also, note that long frames are useful for coding efficiency. 

As in other sections of this book, the focus in this chapter will be on time 
resolution and pre-echo distortion. Pre-echo results from both of the localiza- 
tion limitations of the sinusoidal model: within a frame and across frames. The 
first problem is addressed by using shorter frames, ?.e. atoms with shorter time 
support, directly at an attack, and the latter by incorporating shorter frames 
in the neighborhood of the attack to limit the spreading. 

There are two distinct approaches by which expansion functions with a va- 
riety of time supports can be admitted into the decomposition. In methods 
based on filter banks, subband filtering is followed by sinusoidal modeling of 
the channel signals with long frames for low-frequency bands and short frames 
for high-frequency bands. In time-segmentation methods, the frame size is var- 
ied dynamically based on the signal characteristics; short frames are used 
near transients and long frames are used for regions with stationary behavior. 
These methods are discussed in Sections 3.3 and 3.4, respectively. 

The multiresolution sinusoidal models to be considered incorporate the time- 
frequency localization advantages of wavelet-based approaches while preserving 
the flexibility provided by the parametric nature of the sinusoidal model. Since 
multiresolution and wavelets are intrinsically related, these topics are exam- 
ined in the next section as a precursor to further discussion of multiresolution 
sinusoidal modeling. 


3.2 MULTIRESOLUTION SIGNAL DECOMPOSITIONS 


The basic concept of multiresolution was discussed in Section 1.5.1. Here, the 
idea is developed further; the development is based on the wavelet transform, 
which is inherently connected to the notion of multiresolution [137, 238]. 


3.2.1 Wavelets and Filter Banks 


In this section, wavelets serve as a framework for considering multiresolution as 
well as the relationship between atomic and filter bank models; an understand- 
ing of the wavelet transform will also be useful for future considerations, partic- 
ularly those of Chapter 5. The focus here will be on the discrete wavelet trans- 
form (DWT) and not the related continuous-time wavelet transform (CWT); 
for a treatment of the CWT, the reader is referred to [238]. This treatment is 
not intended as an exhaustive review of wavelet theory but rather as a discus- 
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Figure 3.3. Critically sampled perfect reconstruction two-channel filter banks having this 
structure can be used to derive the discrete wavelet transform. In the literature, such a 
structure is often depicted with the simple line drawing shown. In many applications of such 
structures, ho[n] and hy[n] are respectively a lowpass and a highpass filter; likewise for 


go[n] and gi [n]. 


sion of wavelets with a view to understanding multiresolution and related signal 
modeling issues. The treatment is restricted primarily to conceptual matters 
here; various mathematical details are provided in Appendix A. 


Two-channel critically sampled perfect reconstruction filter banks. 
The discrete wavelet transform can be formulated in terms of critically sampled 
two-channel perfect reconstruction filter banks such as the one shown in Figure 
3.3. For such a structure, the condition for perfect reconstruction can be readily 
derived in terms of the z-transforms of the signals and filters; details of the 
derivation are given in Appendix A. The resulting constraints on the filters 
can be summarized as: 


Gi(z)Hj(z) + Gi(-z)Hj(—z) = 26[t — J]. (3.5) 


In the next section, this condition leads to an interpretation of the filter bank 
in terms of a biorthogonal basis. 


Perfect reconstruction and biorthogonality. By manipulating the con- 
dition in Equation (3.5), it can be shown that a perfect reconstruction filter 
bank derives a signal expansion in a biorthogonal basis; the basis is related 
to the impulse responses of the filter bank. This relationship is of particular 
interest in that it establishes a connection between the filter bank model and 
the atomic model that underlie the discrete wavelet transform. 

A full mathematical treatment of this topic is given in Appendix A; the 
result is simply that the perfect reconstruction condition in Equation (3.5) can 
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be expressed in the time domain as 

(9i[k],hj[2n — k]) = d[nfoli — 3], (3.6) 
or equivalently as 

(hilk], g;[2n — k]) = d[n]6fé — 3 (3.7) 


These expressions show that the impulse responses of the filters, with one of 
the impulse responses time-reversed as indicated, constitute a pair of biorthog- 
onal bases for discrete-time signals (with finite energy), namely the space 1?(z); 
the time shift of 2n in the time-reversed impulse response arises because of 
the subsampling of the channel signals. The symmetry in the above equations 
indicates that the analysis and synthesis filter banks are mathematically in- 
terchangeable; this symmetry is analogous to the equivalence of left and right 
matrix inverses discussed in Section 1.4.1. 


Interpretation as a signal expansion in a biorthogonal basis. The 
result given above indicates that perfect reconstruction and biorthogonality are 
equivalent conditions; this implies that there is a strong connection between 
filter banks and signal expansions. Specifically, the impulse responses of a 
perfect reconstruction filter bank are related to underlying biorthogonal bases; 
it is shown below that the filter bank computes a signal expansion using these 
bases. Using the notation given in Figure 3.3, the output of the two-channel 
filter bank can be expressed as follows; more details of the derivation are given 
in Appendix A: 


Z[n] = Zo[n] + 2,[n] (3.8) 
= YS ~yolk]go(n — 2k] + S~wilk]gi[n — 2k] (3.9) 
k k 
= (2[m], ho[2k — m]) go[n — 2k] (3.10) 
k 


+ S° (2[m], hy[2k — m]) gi[n — 2k] 
k 


¥_ > elm], hil2k — m])gi[n — 2k]. (3.11) 


i=l 
Introducing the notation 
giz(r] = giln—2k] and ai, = (x[m],h,[2k —m)), (3.12) 


the signal reconstruction can be expressed as an atomic model: 


an] = S-> aingialnl. (3.13) 


i€{1,2},k 
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The coefficients in the atomic decomposition are derived by the analysis filter 
bank, and the expansion functions are time-shifts of the impulse responses of the 
synthesis filter bank. As noted earlier, the filter banks are interchangeable; the 
signal could also be written as an atomic decomposition based on the impulse 
responses h;,,[n]. In either case, the atoms in the signal model correspond to 
the synthesis filter bank. 


It has thus been shown that filter banks compute signal expansions. In- 
deed, any critically sampled perfect reconstruction filter bank implements a 
signal expansion in a biorthogonal basis, and any filter bank that implements 
a biorthogonal expansion provides perfect reconstruction; biorthogonality and 
perfect reconstruction are equivalent conditions [238]. At this point, however, 
the notion of multiresolution has not yet entered the considerations; the atoms 
in the decomposition of Equation (3.13) do not have multiresolution proper- 
ties. In the next section, it is shown that multiresolution can be introduced by 
iterating two-channel filter banks. Such iteration is the foundation of wavelet 
packets and the discrete wavelet transform. 


Tree-structured filter banks and wavelet packets. A wide class of sig- 
nal transforms, known as wavelet packets, are based on the observation that 
a perfect reconstruction filter bank with a tree structure can be derived by 
iterating two-channel filter banks in the subbands. Examples of such tree- 
structured filter banks are depicted in Figure 3.4. For this treatment, it is im- 
portant to note that the filters Ho(z) and H,(z) are generally a lowpass and a 
highpass, respectively, and likewise for Go(z) and G,(z); this lowpass-highpass 
filtering in the constituent two-channel filter banks leads to spectral decompo- 
sitions such as those depicted in Figure 3.4 for the given tree-structured filter 
banks. Frequency-domain interpretations of aliasing cancellation and signal 
reconstruction based on this lowpass-highpass structure are given in [232, 238]. 


Arbitrary tree-structured filter banks that achieve perfect reconstruction can 
be constructed by iterating two-channel perfect reconstruction filter banks; in- 
deed, the filter trees can be made to adapt to model nonstationary input sig- 
nals while still satisfying the reconstruction constraint [102]. In this treatment, 
the primary issue of interest is the manner in which iteration of two-channel 
subsampled filter banks leads to multiresolution. The basic principle is that a 
two-channel filter bank splits its input spectrum into two bands and the ensuing 
downsampling spreads each band such that the subband signals are again full 
band (considered at the subsampled rate); this successive halving leads to the 
spectral decompositions given in Figure 3.4 for the specific filter banks shown. 
The spectral decompositions indicate multiresolution in frequency, which is in- 
herently coupled to multiresolution in time by the principle that to increase 
frequency resolution, it is necessary to decrease time resolution. The connec- 
tion is immediate: the narrowest spectral bands correspond to the deepest 
levels of iteration; each iteration involves a convolution, which spreads out the 
time resolution of the overall branch, so the subbands that are most localized 
in frequency are least localized in time. 
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Figure 3.4. Tree-structured filter banks that satisfy the perfect reconstruction condition 
can be constructed by iterating two-channel perfect reconstruction filter banks. Such itera- 
tion is fundamental to the discrete wavelet transform as well as arbitrary wavelet packet filter 
banks. These iterated filter banks provide multiresolution analysis-synthesis as suggested by 
the indicated spectral decompositions. Note that the discrete wavelet transform derives an 
octave-band decomposition of the signal. 


The brief description of multiresolution in tree-structured filter banks sug- 
gests why such methods might prove useful for processing arbitrary signals, 
especially if the filter bank is made adaptive; application examples include 
compression [102, 191] and spectral estimation [233]. Rather than focusing on 
such arbitrary tree-structured filter banks here, however, additional develop- 
ments of the multiresolution concept will be formulated for the specific case of 
the discrete wavelet transform. As noted in Figure 3.4, the discrete wavelet 
transform corresponds to successive iterations on the lowpass branch. 
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The discrete wavelet transform. The discrete wavelet transform is per- 
haps the most common example of a tree-structured filter bank. It has been 
widely explored in the literature [232, 238]. Here, the discussion is limited to 
general signal modeling issues. 


The discrete wavelet transform is constructed by successive iterations on 
the lowpass branch. Given that Hpo(z) and H(z) are respectively a lowpass 
and a highpass filter, the filtering operations can be readily interpreted. The 
first stage splits the signal into a highpass and lowpass band, each of which 
is spread to full band by the subsequent downsampling. Given this spreading 
that accompanies downsampling, the effect of the second stage is to simply 
split the lowpass portion of the original signal into halves. Each stage of the 
discrete wavelet transform splits the lowpass spectrum from the previous stage; 
this results in an octave-band decomposition of the signal, which is depicted in 
an ideal sense in Figure 3.4. 


As noted in the previous section, the deepest levels of iteration correspond 
to narrow frequency bands that necessarily lack time resolution. This tradeoff 
is very natural for octave-band decompositions. Low frequency signal com- 
ponents change slowly in time, so time resolution is not important. On the 
other hand, high frequency components are characterized by rapid time vari- 
ations; to track such variations from period to period, for instance, time lo- 
calization is important. This is exactly the time-frequency tradeoff provided 
by the discrete wavelet transform. Since the auditory system exhibits such 
frequency-dependent resolution, the wavelet approach has been considered for 
the application of auditory modeling [16, 153, 231]. 


The time-frequency localization in a given subband depends on its depth in 
the filter bank tree. A mathematical treatment of this is most easily carried out 
for a specific example. Consider a wavelet filter bank tree of depth three. By 
interchanging filters and downsamplers in the analysis bank and interchanging 
filters and upsamplers in the synthesis bank, a depth-three discrete wavelet 
transform filter bank based on the filters Go(z),Gi(z), Ho(z), and H;(z) can 
be recast into the form shown in Figure 3.5; here, the deepest branches of the 
wavelet tree are now the filters with the most multiplicative components and 
the highest downsampling factors. The frequency-domain multiplication serves 
to narrow the frequency response and improve the frequency localization; the 
corresponding time-domain convolution serves to broaden the impulse response 
and decrease the time resolution. This spreading is shown in Figure 3.6 for 
a type of Daubechies wavelet that will be used for all of the wavelet-based 
simulations in this book [41]; the functions shown are the impulse responses of 
the synthesis filters in Figure 3.5. Note that the subband signals in the wavelet 
filter bank are at different sampling rates; appropriately, the narrowest bands 
have the lowest sampling rate. Furthermore, it is important to keep in mind 
that the synthesis filter bank carries out aliasing cancellation. 


Atoms and filters. Earlier, the atomic model of the subband signals in a 
two-channel filter bank was derived. A similar model can be arrived at for the 
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Figure 3.5. A tree-structured wavelet filter bank with three stages of iteration can be 
manipulated into this equivalent form. 
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Figure 3.6. Impulse responses of a wavelet synthesis filter bank for a type of Daubechies 
wavelet. The expansion functions in the corresponding wavelet decomposition are these 
impulse responses and their shifts by 2, 4, 8, and 8 as indicated by the downsampling and 
upsampling factors in the filter bank of Figure 3.5. 


discrete wavelet transform [238]. The transform can thus be interpreted as a 
filter bank or as an atomic decomposition; there is a similar duality here as in 
the interpretations of the STFT discussed in Section 2.2.1, and the interpreta- 
tions are connected by way of the tiling diagram. The two interpretations are 
further linked by a notion of evolution in that a subband signal is derived as an 
accumulation of atoms corresponding to the impulse responses of the synthesis 
filter in that band. The evolution, however, is not signal-adaptive as in the 
sinusoidal model. 
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Figure 3.7. A pyramid structure for multiresolution filtering. This diagram depicts the 
analysis filter bank of the pyramid approach, which actually incorporates the synthesis process 
to ensure perfect reconstruction; synthesis is carried out by a structure similar to the right 
side of the analysis pyramid. 


3.2.2 Pyramids 


Multiresolution decompositions can be derived using pyramid structures such 
as the one in Figure 3.7. These were originally introduced for multiresolution 
image processing [29]; the relationship to wavelets was realized shortly there- 
after. The decomposition is based on the idea of successive refinement; the 
signal is modeled as a sum of a coarse version (the top of the pyramid) plus 
detail signals. 

There are several interesting things to note about the pyramid approach. 
Most importantly, perfect reconstruction is immediate; there are no elaborate 
constraints. This ease of perfect reconstruction is related to the fact that the 
pyramid decomposition is not critically sampled. Note that the coarse signal 
estimate derived at the highest level of the pyramid is analogous to the output 
of the lowest branch of a wavelet filter bank tree, but that the detail signals in 
the pyramid scheme are at higher rates than the corresponding detail signals 
in a wavelet filter bank; the output signal at the lowest level of the pyramid is 
itself full-rate. For the pyramid in Figure 3.7, the representation is oversampled 
by a factor of 1+ 5 + ; = f. for continued iterations, the oversampling factor 
asymptotically approaches two. Along with simplifying perfect reconstruction, 
this oversampling results in added robustness to quantization noise [238]. Note 
also that the synthesis filters are included in the analysis; the result is an 
analysis-by-synthesis process that can be made to resolve some of the difficulties 
in wavelet filter banks. For instance, a pyramid-structured filter bank can be 
defined such that the subband signals are free of aliasing [66, 206]. 

In the pyramid of Figure 3.7, the signal decomposition is based on successive 
applications of the same filter pair {Ho(z),Go(z)}. This is just one example 
of a pyramid approach, however. The pyramid structure can be generalized 
by applying arbitrary signal models on the levels of the pyramid rather than 
filtering and downsampling; for instance, in image coding it is common to apply 
nonlinear interpolation and decimation operators in such pyramid filters [238]. 
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Figure 3.8. General structure of subband sinusoidal modeling. Alternatively, the sinusoidal 


model can be designed to yield signals that are intended as inputs to a synthesis filter bank, 
but this method has difficulties with aliasing cancellation. 
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3.3. FILTER BANK METHODS 


Filter bank methods for multiresolution sinusoidal modeling involve modeling 
the subband signals; a basic block diagram for this subband approach is given 
in Figure 3.8. The signal is split into bands of varying width, and each subband 
signal is the modeled with a separate sinusoidal model with resolution commen- 
surate to the bandwidth — for narrow bands, long windows are used, and for 
wide bands, short windows are used. The filter bank in Figure 3.8 is shown as a 
generalized block since it may take the form of a discrete wavelet transform, an 
adaptive wavelet packet, a pyramid structure, or a nonsubsampled filter bank. 
These are discussed in turn in the following sections. Noting the similarity of 
this structure to that of Figure 2.6, these methods based on filter banks can be 
interpreted in some sense as multiresolution phase vocoders. 

Note that the methods to be discussed generally involve octave-band filter- 
ing, which is perceptually reasonable since the auditory system exhibits roughly 
constant-Q resolution [153]. Such octave-band filtering is useful with regards to 
the pre-echo problem. As shown in Section 2.6, the pre-echo in the sinusoidal 
model depends on the window length; by using smaller windows for higher fre- 
quencies, the pre-echo becomes proportional to frequency in these filter bank 
methods. This proportionality is psychoacoustically viable in that perception 
of pre-echo is seemingly dependent on frequency; for a given partial, the percept 
depends not on the absolute length of the pre-echo but rather on how many 
periods of the partial occur in the pre-echo [131]. Using that principle, pre-echo 
distortion can be alleviated by using long frames for low-frequency partials and 
short frames for high-frequency partials. 
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3.3.1 Multirate Schemes: Wavelets and Pyramids 


Multirate systems are effective for dividing signals into subbands with low com- 
plexity and, in the critically sampled case, without increasing the amount of 
data in the representation. However, the analysis filtering process generally 
introduces aliasing, so the synthesis must incorporate aliasing cancellation to 
achieve a reasonable signal reconstruction. This aliasing leads to difficulties in 
the wavelet case that can be resolved by using a pyramid structure [131, 132]; 
in Section 3.3.2, such issues are circumvented by using a nonsubsampled filter 
bank. 


Wavelets. Sinusoidal modeling based on wavelet filter banks can be carried 
out in several ways. One approach is to model the downsampled subband sig- 
nals, carry out a sinusoidal reconstruction of each subband at the downsampled 
rate, and use a wavelet synthesis filter bank to construct the full-rate signal. 
The same frame length is used in each subband. Then, because the lowpass 
band has the lowest sample rate, the lowpass frames have the longest effective 
time support; by the same token, the frames in the highpass band have the 
shortest time support. This modeling method thus results in a parametric sig- 
nal representation with the multiresolution properties of the discrete wavelet 
transform. However, it has difficulties because the sinusoidal model does not 
provide perfect reconstruction; aliasing cancellation is not guaranteed in the 
synthesis filter bank because the subbands are modified in the modeling pro- 
cess. This difficulty can be circumvented by reconstructing the output from 
the subband models without using the synthesis filter bank; the full-rate re- 
construction is derived directly from the models of the downsampled subbands 
[202]. In this variation, it is necessary to explicitly account for aliasing in the 
sinusoidal parameter estimation; aliasing cancellation is incorporated into the 
estimation of the subband spectral peaks, but this typically accounts for only 
the aliasing between adjacent bands [8]. This method has reportedly proven 
useful for speech coding and time-scaling [8, 202]. An earlier hybrid algorithm 
involving wavelet-like filtering and sinusoidal subband modeling was reported 
in [53] for the application of source separation; in that case, the filter bank is 
oversampled in order to reduce the aliasing limitations. 


Wavelet packets. In the approaches discussed above, the subbands of a 
wavelet filter bank are represented with the sinusoidal model to allow for mod- 
ifications and processing. Such techniques can be conceptually generalized to 
the case of adaptive wavelet packets, where the tree-structured filter bank is 
varied in time according to the signal behavior; heuristically, the adaptation 
can be interpreted as follows: during transient behavior, the filter bank has 
short impulse responses to track the time-domain changes, and during station- 
ary behavior the impulse responses are lengthened to improve the frequency 
resolution. Such wavelet packet vocoders have not been formally considered in 
the literature. 
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Pyramid structures. Octave-band filtering without subband aliasing can 
be carried out using a pyramid structure [66]. As in the pyramid structure 
of Figure 3.7, the subband representation is oversampled by a factor of two 
(asymptotically); here, the overcomplete representation provides an improve- 
ment over the critically sampled case in that the subbands are free of aliasing. 
This filter bank has recently been proposed as a front end for multiresolution 
sinusoidal modeling. The resulting algorithm has been shown to be effective 
for modeling a wide range of audio signals (131, 132]. 


3.3.2 Nonsubsampled Filter Banks 


In multirate filter banks, perfect reconstruction requires aliasing cancellation. 
In other words, there is inherently some degree of aliasing in the subband 
signals that is cancelled by the synthesis filter bank. This cancellation is a very 
exacting process; if an approximate representation such as the sinusoidal model 
(or even quantization) is applied in the subbands prior to synthesis, aliasing 
cancellation in the reconstruction is not guaranteed. 

The methods discussed above use various approaches to overcome aliasing 
problems. These issues do not arise, however, if a nonsubsampled filter bank is 
used to split the input signals into the requisite bands. Such filter banks satisfy 
the perfect reconstruction constraint 


> zq[n] = [rn], (3.14) 


meaning that there is no aliasing or distortion introduced in the subband sig- 
nals. The design of nonsubsampled filter banks that meet this constraint is 
straightforward; the design process is discussed explicitly in Section 4.3.1. A 
decomposition in terms of alias-free subbands that meet the condition given in 
Equation (3.14) can indeed be arrived at using a nonsubsampled wavelet filter 
bank; the design method in Section 4.3.1, however, allows for more flexible 
spectral decompositions than the octave-band model derived by a wavelet filter 
bank. 

In the multirate filter banks previously discussed, the subbands have differ- 
ent sampling rates. Then, a window of some fixed length can be applied in the 
subbands; with respect to the original sampling rate, the window in the low- 
pass band has the longest time support and the window in the highpass band 
has the shortest time support. The multiresolution in that case is provided 
by the multiplicity of sampling rates. In the case of a nonsubsampled filter 
bank, multiresolution is achieved by using windows of different lengths in the 
subbands. This approach is depicted in a heuristic sense in Figure 3.9 for the 
case of a nonsubsampled octave-band filter bank. 

Nonsubsampled filter banks are subject to much looser design constraints 
than multirate filter banks; this advantage arises because no aliasing cancel- 
lation is required. However, nonsubsampled filter banks have a disadvantage 
with respect to multirate structures in that more computation is required to 
perform the filtering. Furthermore, in the nonsubsampled filter banks designed 
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Figure 3.9. Multiresolution sinusoidal modeling using a nonsubsampled filter bank. The 
filter bank provides an octave-band decomposition; the sinusoidal models have frame sizes 
scaled by powers of two according to the width of the respective subband. As described in 
the text, it is straightforward to design filter banks that derive other decompositions but it 
is not feasible to optimize the filter bank and the sinusoidal models for modeling arbitrary 
signals. 


according to the method of Section 4.3.1, all of the filters in the filter bank are 
required to be of the same length; this supports the contention that multirate 
structures are more appropriate for multiresolution analysis. However, this is a 
somewhat inappropriate conclusion for the application at hand; as long as the 
filter bank impulse responses are of shorter duration than the sinusoidal anal- 
ysis windows, the time resolution is limited by the subband sinusoidal models 
and not by the filter bank. Again, note that in the multirate structures the 
same window and stride can be used in each of the subbands; the multires- 
olution in those cases results from the fact that the subbands have different 
sampling rates. In nonsubsampled filter banks, multiresolution is achieved by 
choosing different window sizes and strides in the various subbands. 

For a multiresolution sinusoidal model based on a filter bank, optimal design 
is prohibited by the large number of design parameters. The performance is 
influenced in complicated ways by the choices of filter band edges and frequency 
response properties as well as the parameters of the subband sinusoidal models 
(the number of partials, the window sizes, and the analysis strides). While 
heuristic designs can lead to modeling improvements as shown in Figure 3.10, 
a given design is not necessarily ideal for arbitrary signals. In a sense, if the 
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Figure 3.10. Multirate sinusoidal modeling using a nonsubsampled filter bank. The orig- 
inal signal in (a) is the onset of a saxophone note. Plot (b) is a sinusoidal reconstruction 
using a fixed frame size of 1024; plot (c) is the residual for that case. The plot in (d) shows 
a reconstruction based on sinusoidal modeling of the subbands of a nonsubsampled 7-band 
octave filter bank. Ranging from the lowest to the highest band, the subband sinusoidal 
models use synthesis frame sizes of 1024, 768, 512, 512, 256, 256, and 256. Plot (e) shows 
the residual for the filter bank case. 


filters and subband models are fixed, the problem is again a lack of signal 
adaptivity; the approach is rigid and can thus break down for some signals. 
In the next section, a signal-adaptive multiresolution framework based on time 
segmentation is considered. 


3.4 ADAPTIVE SEGMENTATION 


This section considers algorithms for deriving signal models based on adap- 
tive time segmentation. The idea is to allow segments of variable length in a 
model so that appropriate time-frequency localization tradeoffs can be applied 
in various regions of the signal; this same idea of adaptive time resolution mo- 
tivates the use of window switching in modern audio coding filter banks, but in 
the segmentation approaches to be discussed the resolution tradeoffs are more 
flexible than in typical signal-adaptive filter banks. 

A signal-adaptive segmentation can be arrived at by an exhaustive global 
search, by a dynamic program, or by a heuristic approach. These three meth- 
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ods are discussed in this section; the focus is placed on dynamic programs 
for segmentation, which can arrive at optimal models with substantially less 
computation than a global search. 


3.4.1 Dynamic Segmentation 


Given an entire signal and arbitrary allowances for intensive off-line computa- 
tion, an optimal segmentation with respect to some modeling metric can be 
derived by a globally exhaustive search. If the metric is additive and inde- 
pendent across segments, however, the computational cost can be substantially 
reduced using a dynamic program. This approach has been applied to wavelet 
packets and linear predictive coding [102, 174, 191]; after a brief review of 
dynamic programming and the relevant literature, dynamic segmentation for 
sinusoidal modeling is considered. 


Dynamic programming. Dynamic programming was first introduced for 
solving minimum path-length problems [15]. The notion is that the computa- 
tional cost of some problems can be reduced by solving the problems in sequen- 
tial stages; redundant computation is avoided by phrasing a global decision in 
terms of successive local decisions. This type of approach has found widespread 
use for sequence detection in digital communication, where it is referred to as 
the Viterbi algorithm [126]. Similar ideas play a role in hidden Markov model- 
ing, which is central to many speech recognition systems [189, 243). 
The dynamic programming method can be outlined as follows [23]: 


= Consider the choice of a solution as a sequence of decisions. 


= Incorporate a metric for the decisions such that the metric for the overall 
solution is the sum of the metrics for the individual sequential decisions. 


= Assuming that some of the necessary decisions have been made, determine 
which decisions must be considered next and evaluate the metric for those 
decisions. 


= Starting at the point where no decisions have been made, carry out a re- 
cursion to determine the set of decisions that are optimal according to the 
additive metric. 


This description is rather general since the dynamic programming approach is 
itself quite general. The issues at hand are further clarified in the following dis- 
cussion of the application of dynamic programming to signal segmentation and 
modeling; also, the computational efficiency afforded by dynamic programming 
is quantified. 


Notation and problem statement. A mathematical treatment of the seg- 
mentation problem requires the introduction of some new notation; this is given 
here along with various assumptions about the signal and the computation re- 
quirements for modeling. First, there is some smallest segment size € for the 
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signal segmentation. Segments of length e will be referred to as cells, and it will 
be assumed that the signal is N cells long, ¢.e. the signal is of length Ne. For 
general signal modeling, it is of interest to have a very flexible set of segment 
lengths to choose from; the set, which will be denoted by A, is assumed to 
consist of consecutive integer multiples of the cell size: 


A = {e, 2e,3e,... , Le}. (3.15) 


A particular element from such a set of segment lengths will be denoted by 4X. 

Two specific cases will be considered in the treatment of computational 
cost. The first case is L = N, which implies that the implementation has no 
memory restrictions; for a signal of arbitrary length, the algorithm is capable 
of computing a model on a segment covering the span of the entire signal. The 
second case is L < N (and sometimes L << N), which corresponds to the case 
of an implementation with finite memory. This restriction on L is somewhat 
analogous to the truncation depth commonly used to reduce the delay in Viterbi 
sequence detection [126]. 

Using a diverse set of segment lengths allows for flexibility in signal modeling. 
Additional signal adaptivity can be achieved by allowing for a choice of model 
for each segment. One example of such a model choice is the filter order in 
a linear prediction application [174]. In the sinusoidal modeling case to be 
discussed, there is not a multiplicity of candidate models for each segment, so 
this issue is not considered here. Note that if the evaluations of each model on 
a given segment require the same amount of computation, allowing for a choice 
of model does not affect the computation comparisons to be given. 

The problem of signal modeling with adaptive segmentation is simply that 
of choosing an appropriate set of disjoint segments that cover the signal. The 
segmentation is chosen so as to optimize some metric; for proper operation of 
the dynamic program, it is required that the metric be independent and additive 
on disjoint segments. Then, the total metric for a segmentation o composed 
of segments A; can be expressed as a sum of the metrics on the constituent 
segments: 


D(s) = > Dr); (3.16) 


where 7 is a segment index and where the constituent disjoint segments of the 
segmentation o satisfy 


Ne = er (3.17) 


for a signal of length Ne. Mean-square error and rate-distortion metrics can 
be applied in this framework [102, 191, 174]. 


Computational cost of global search. The globally optimal segmenta- 
tion is simply the segmentation which minimizes the metric D(a). Obviously, 
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this minimization can be arrived at by a globally exhaustive search in which 
the metric is computed for every possible segmentation in turn. The brief 
consideration here indicates that this exhaustive approach is computationally 
prohibitive for long signals. This difficulty motivates formulating the metric 
computation as a dynamic program. 

In a globally exhaustive search, a model must be evaluated on each segment 
in each possible segmentation. Assuming that the cost of model evaluation 
is independent and additive on disjoint segments, a simple estimate of the 
computational cost of a global search can be arrived at by counting the total 
number of segments in all of the possible segmentations. This measure assumes 
that the cost of model evaluation on a segment is independent of the segment 
length, which is admittedly a somewhat unrealistic assumption; for example, 
an FFT of a segment of length A requires on the order of Alog A multiplies. 
The computational cost for other types of models are generally dependent on 
the segment length as well, so this enumeration of segments is by no means a 
formal cost measure but rather a basic feasibility indicator. 

As motivated above, the computational cost of the global segmentation algo- 
rithm can be quantified by counting the total number of segments in all of the 
possible segmentations. For the case L = N, this enumeration can be derived 
by simple combinatorics. Noting that there are N — 1 cell boundaries in the 
interior of the signal and that each of these can be independently chosen as 
a segment boundary in the signal segmentation, there are 2%—! possible seg- 
mentations; furthermore, the average number of segments in a segmentation is 
(N + 1)/2. The total number of segments in all of the possible segmentations 
is given by 


C = [number of segmentations] [number of segments per segmentation], 
(3.18) 


so the cost of global search for the case L = N is 
Cr=n = 2N-?2(N +1), (3.19) 
which is governed by an exponential dependence on the signal length: 
Cron x 2%. (3.20) 


In the truncated case L < N, the segment count does not have a simple for- 
mulation as in the unrestricted case. It can be shown, however, that the total 
number of segments is still governed by an exponential dependence on the sig- 
nal length.’ In either of these cases, the exponential dependence on the signal 
length prohibits model evaluation via exhaustive computation. 


1For L < N, the number of possible segmentations is given by the N-th term of an L-th order 
Fibonacci series; this N-th term has an exponential dependence on N. Following the frame- 
work of Equation (3.18), the total number of segments in all of the possible segmentations is 
then given roughly by the product of this exponential term and the signal length. 
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The next section describes a dynamic program that can derive the same op- 
timal segmentation as an exhaustive search, but with a cost that is governed by 
a quadratic dependence on the signal length for the case L = N and a linear de- 
pendence for L << N. This cost reduction is achieved by removing redundant 
computation; the basic insight in the dynamic programming approach is that 
though some segment \ is a component of many distinct segmentations, it is 
not necessary to calculate D(A) for each such occurrence. A dynamic program 
provides a computational framework in which the cost of evaluating a model 
on a given segment is incurred only once. 


Reduction of computational cost via dynamic programming. The 
first step in a dynamic approach to signal segmentation is to consider the time 
span of the signal as a concatenation of cells. The boundaries between cells will 
be referred to as markers; because of the integer-multiple construction of the 
allowable segment lengths, the boundaries in any valid segmentation will align 
with some of these markers, so they can effectively be used as indices. In the 
dynamic program, each marker is treated as a possible segment boundary for 
the signal segmentation; the markers serve as nodes in the dynamic program. 

Without loss of generality, the algorithm will be explained in terms of the 
examples shown in Figures 3.11 and 3.12, which correspond to the cases L = 
N and L < N, respectively. In the figures, D,, represents the distortion 
metric associated with the segment of length (b—a)e between markers a and b. 
Further notation required for the explanation is as follows. At any marker a, the 
dynamic algorithm has determined the segmentation that leads to the minimum 
distortion up to that marker. This partial segmentation will be denoted by oq 
and the corresponding distortion will be denoted by D(¢,); this distortion is the 
minimum modeling metric achievable for segmenting the signal up to the a-th 
marker. The term A, will be used to denote the length of the last segment in the 
segmentation o, that achieves the minimum metric D(aq); the algorithm stores 
this value at each marker so that the optimal segmentation can be recovered 
by backtracking after the end of the signal is reached. 

Using the notation established above, the steps of the algorithm in the case 
L=WN areas follows; this corresponds to the illustration in Figure 3.11: 


= Evaluate Doi, the modeling metric for the cell between markers 0 and 1, and 
store the result as D(o1). 


» Evaluate Dig and Doo. 


a Find D(o2) = min{Do2, D(o1) + Diz}. This minimum indicates the best 
segmentation g2 between markers 0 and 2. 


w Store D(c2) and 2, the length of the last segment in o>. 
= Evaluate Do3, Diz, and Dos. 


w Find D(o3) = min{Do3, Di3+ D(o1), Dos +D(o2)}. This minimum indicates 
the best segmentation o3 between markers 0 and 3. 
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= Store D(g3) and As. 

m Evaluate D34, Dos, Di4, and Dos. 

mw Find D(o4) = min{Doa4, Dig + D(o1), Dosa + D(o2), D34 + D(o3)}. 
= Store D(o4) and 4. 


= Continue in this manner until the end of the signal is reached; note that each 
successive marker introduces a larger number of new candidate segments for 
consideration. The minimum D(on) calculated at the last marker is the 
globally optimal metric; as mentioned earlier, the optimal segmentation an 
can be found by backtracking through the recorded segment lengths. 


Note that to determine the segmentation that yields the minimum metric, it is 
necessary to store the appropriate segment length at each marker. The mini- 
mum metric itself, however, can be computed without storing path information. 

The computational cost of the algorithm described above, namely an enu- 
meration of the number of segments on which models are evaluated, can be 
easily determined by considering Figure 3.11. The number of candidate seg- 
ments that must be evaluated at each marker is equal to the value of the marker 
index, so the cost is simply 


Cr-n = 14+24+3+...+N (3.21) 
1 
5 (N? +N), (3.22) 


where the bar is included in the notation C to specify that the cost corresponds 
to a dynamic algorithm. Noting the dominant term in the above expression, 
the cost of dynamic segmentation with L = N can be summarized as: 


This quadratic dependence on the signal length is a considerable improvement 
over the exponential dependence of an exhaustive global search. 

For the case L < N, depicted in Figure 3.12, the steps in the algorithm are 
the same as above, with the exception of the later stages where the bounded 
segment length comes into effect: 


= Evaluate Do; and store the result as D(o;). 
= Evaluate D2 and Doo. 

w Find D(o2) = min{Do2, D(o1) + Diz}. 

= Store D(a) and Ao. 

# Evaluate Dos, Di3, and Dog. 


» Find D(o3) = min{Dps3, D3 + D(o,), Doz + D(a2)}. 
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Figure 3.11. A depiction of a dynamic segmentation algorithm for the case L = N, 
where the segment lengths are not restricted. As derived in the text, the computational cost 
of the algorithm grows quadratically with the length of the signal in this case. 


Store D(o3) and As. 

» Evaluate Dg4, Dosa, and Dy4. 

m Find D(o4) = min{Di4 + D(o1), Dea + D(o2), Da + D(o3)}. 
=m Store D(o4) and Aq. 


=» Continue in this manner until the end of the signal is reached; note that after 
marker L, each additional marker introduces a fixed number of candidate 
segments, namely L. The minimum D(cn) calculated at the last marker is 
the globally optimal metric for this case; the optimal segmentation ow can 
be found by backtracking through the recorded segment lengths. 


The computational cost of the truncated approach can be derived by consider- 
ing Figure 3.12, which indicates that the algorithm has a repetitive structure 
after the startup. The number of segments on which models are evaluated is 
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given by 
Chen = 14+2+...42-14+(N-L+)DL (3.24) 
ee 
startup 
= NL - = (L? —L). (3.25) 


The cost of dynamic segmentation with L < N can thus be summarized as 
Cr<n « N, (3.26) 


where the omission of the terms involving L is particularly valid for cases where 
L << N, i.e. processing of arbitrarily long signals. For instance, in high-quality 
modeling of music it is necessary to have L << N due to computational and 
memory limitations. Furthermore, it is sensible to restrict the segment lengths 
since music is nonstationary in a global sense; it is unreasonable to assume that 
a one-segment model could describe an entire signal, so the candidate segment 
lengths can be justifiably bounded by some finite duration for which there is a 
possibility of local stationarity. In such cases, the cost grows linearly with the 
signal length, which is an improvement over both the global case of Equation 
(3.20) and the dynamic approach with unrestricted segment lengths described 
in Equation (3.23). 

Applications of dynamic segmentation are discussed in the following; adap- 
tive wavelet packets, linear predictive coding, and sinusoidal modeling can all 
be carried out in this framework. One caveat to note, however, is that in some 
of these methods it is necessary to use overlapping segments to ensure signal 
continuity at the synthesis frame boundaries. In such cases, the algorithm is 
not guaranteed to find the globally optimal segmentation; in practice, however, 
the effect is negligible, so the dynamic segmentations can be justifiably referred 
to as optimal [191]. A further issue to note is that the dynamic segmentation 
method, as described, considers the entire signal before a final decision is made 
regarding the segmentation; in this form, it is only suitable for off-line compu- 
tation. In applications such as voice coding for telephony, it is of more interest 
to process the signal in blocks that can be transmitted sequentially. Dynamic 
segmentation can be applied in such scenarios by monitoring the candidate 
segmentation. This is based on the principle that the segmentation choices 
in signal regions that are distant in time are generally independent; without 
significantly sacrificing the optimality, then, the algorithm can be periodically 
terminated to derive blocks for coding [174]. 


Adaptive wavelet packets. Early applications of dynamic programming 
to signal modeling involved models based on wavelet packets. In [191], the 
best wavelet packet in a rate-distortion sense is chosen for the model for each 
segment; in [102], dynamic segmentation is added to allow for localization of 
transients. A similar technique was considered in [247]. 


Arbitrary models. In addition to the wavelet packet algorithms described, 
dynamic segmentation and model selection has been applied to image compres- 
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Figure 3.12. A depiction of a dynamic algorithm for signal segmentation for the case 
L < N, where the segment lengths are restricted. Note the regularity of the recursion after 
the startup; the cost of this algorithm grows linearly with the length of the signal. 


sion [192] and linear predictive coding of speech [174]. As long as the optimality 
metric is independent and additive across disjoint frames, the dynamic program 
can be used to efficiently find the optimal segmentation and model selections. In 
cases where discontinuities across frame boundaries may be objectionable, the 
candidate models tend to have dependencies on adjacent frames; for instance, 
in the image processing application, where discontinuities result in blockiness, 
the candidate models are lapped orthogonal transforms which reduce the block- 
ing artifacts incurred due to quantization [142, 192]. Because of the overlap, 
the dynamic algorithm is not guaranteed to find the globally optimal model, 
but in practice this effect is negligible. In the sinusoidal model application, the 
dynamic algorithm is again possibly suboptimal but this suboptimality turns 
out to be largely irrelevant. 


Sinusoidal modeling. As seen in Section 2.6, a sinusoidal model with a fixed 
frame size results in delocalization of time-domain transients if the frames are 
too long. This delocalization can be interpreted in terms of the synthesis: the 
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signal is reconstructed in each synthesis frame as a sum of linear-amplitude, 
cubic-phase sinusoids, each of which has the same time support, namely the 
synthesis frame size; this fixed time support results in a smearing of signal 
features across the frame. In addition to this delocalization within each frame, 
features are spread across neighboring frames by the line tracking and parame- 
ter interpolation operations. One consequence of this is the pre-echo distortion 
discussed in Section 2.6; the example from Figure 2.21 is repeated here in Figure 
3.13 for the sake of comparison with the improved models to be considered. 


Time-domain delocalization, e.g. pre-echo distortion, results from the use of 
frames that are too long. For frames that are too short, a similar delocalization 
occurs in the frequency domain; frequency resolution is limited in short frames. 
For modeling arbitrary signals, then, it is of interest to trade off time and 
frequency resolution by selecting frame sizes according to the signal’s behavior. 
This tradeoff can be carried out via dynamic segmentation based on an accuracy 
metric [85]. A viable approach is to choose the metric D(A) to be the mean- 
square error of the reconstruction over the segment A, which is a reasonable 
measure of modeling accuracy. When the mean-square error is used as the 
adaptation metric, the dynamic algorithm chooses short frames near attacks to 
improve time localization and longer frames when the signal exhibits stationary 
behavior, t.e. pseudo-periodicity, since the improved frequency resolution in 
long frames leads to more accurate modeling in such regions. 


The basic advantage of adaptive segmentation in the sinusoidal model is 
that the time support of the constituent linear-amplitude cubic-phase sinusoidal 
atoms is adapted so that localized signal features are accurately represented. 
An example of the pre-echo reduction in such a multiresolution model is given 
in Figure 3.13. In the implementation, the same number of partials is used in 
models of short frames and long frames in order to simplify the line tracking. 
Because of this constant model order, using long frames improves the coding 
efficiency; also, rate considerations can be readily incorporated by scaling the 
metric so as to favor longer frames, but this will not be dealt with here. 


It was mentioned earlier that the dynamic algorithm is not guaranteed to 
find the optimal segmentation if the models in adjacent frames are dependent, 
but that such dependence is indeed required in some cases to prevent disconti- 
nuities in the synthesis. This scenario applies in the case of sinusoidal modeling. 
In the static case, the synthesis frames are demarcated by the centers of the 
analysis frames. There is thus an intrinsic overlap in the modeling process 
as depicted in Figure 3.14. A similar overlap appears in the case of dynamic 
segmentation; as a result, the segmentation is not guaranteed to be optimal. 
The deviation from optimality, however, is basically negligible; the algorithm 
still carries out the intended task of finding appropriate tradeoffs in time and 
frequency resolution for modeling arbitrary signals. Figure 3.14 also indicates 
an important implementation issue, namely that a given segmentation requires 
a specific set of analysis windows to cover the signal. Each candidate segmen- 
tation thus has its own set of sinusoidal analysis results. These various analyses 
can be managed efficiently in the dynamic algorithm. Finally, note that the 
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Figure 3.13. Comparison of residuals for a fixed-frame sinusoidal model and an adaptive 
multiresolution model based on dynamic segmentation. The original signal (a) is a saxophone 
note. Plot (b) is a reconstruction based on a fixed frame size of 1024 and (c) is the 
residual for that case; the dotted lines indicate the synthesis frame boundaries. Plot (d) is 
a reconstruction using dynamic segmentation with frame sizes 512, 1024, 1536, and 2048; 
the segmentation arrived at by the dynamic algorithm is indicated by the dotted lines in the 
plot of the residual (e). In the dynamic model, the attack is well-localized and does not 
contribute extensively to the residual. 


analysis windows, as depicted in Figure 3.14, need not satisfy the overlap-add 
condition. This design flexibility results from the incorporation of a parametric 
representation and applies to the fixed-resolution sinusoidal model as well. 


3.4.2 Heuristic Segmentation 


It is common in the development of signal processing algorithms to first inves- 
tigate optimal or nearly optimal algorithms and then compare the results with 
lower cost methods based on less stringent metrics. In the framework of signal 
segmentation, this is tantamount to considering simple forward segmentation 
based on the local modeling error rather than focusing on global optimality. 
While the global segmentation is an analysis-by-synthesis approach that in- 
volves the entire signal, the forward segmentation is an analysis-by-synthesis 
that simply chooses among the candidate segments at each marker. 
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Figure 3.14. Analysis and synthesis frames in fixed-resolution and multiresolution sinu- 
soidal models. This plot is included to indicate the overlap of the analysis frames. In the 
dynamic segmentation algorithm, this overlap undermines the required independence of the 
segment metrics; as a result, the synthesis segmentation derived by a dynamic program 


is not guaranteed to be globally optimal. This suboptimality is generally inconsequential, 
however. 


A simple algorithm for forward segmentation. In the sinusoidal model, 
a heuristic segmentation approach can achieve similar results as the dynamic 
algorithm for the example of Figure 3.13. The simple algorithm is as follows, 
where the signal segmentation is again described in terms of markers: 


D 
= At marker a, evaluate the metric 5 ~ for b € {a+1,a+2,a+3,...,a+D}, 


where the set corresponds to the candidate segment lengths. 


= Find the marker } which minimizes the weighted metric and advance to that 
marker. 


= Set a new starting point at a= b and repeat the preceding steps. 


Note that in this algorithm the segmentation decisions are made based on local 
minimization of the distortion metric; since local minima are pursued greedily, 
global optimality of the metric is not guaranteed. Of course, many variations of 
forward segmentation can be formulated; for instance, by incorporating some 
dependence on neighboring results, a more global solution can be targeted. 
Such variations will not be considered, however; the intent is merely to draw a 
comparison between dynamic and heuristic segmentation methods. 

Figure 3.15 shows an application of forward segmentation to a saxophone 
attack; for this example, the forward method achieves a similar model as the 
dynamic algorithm, but such comparable performance is not guaranteed for all 
signals. As will be shown in the next section, the forward segmentation requires 
less computation than the dynamic approach. In real-time (or limited-time) 
applications, then, the reduced cost of a forward segmentation method may 
merit this accompanying decrease in modeling accuracy. On the other hand, in 
off-line applications such as compression of images or audio for databases, it is 
more appropriate to use an optimal dynamic algorithm. 
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Figure 3.15. Comparison of residuals for a fixed-frame sinusoidal model and an adaptive 
multiresolution model based on forward segmentation. The original signal (a) is a saxophone 
note. Plot (b) is a reconstruction based on a fixed frame size of 1024 and (c) is the 
residual for that case; the dotted lines indicate the synthesis frame boundaries. Plot (d) is 
a reconstruction using forward segmentation with frame sizes 512, 1024, 1536, and 2048; 
the segmentation arrived at is indicated by the dotted lines in the plot of the residual (e). 
In the forward adaptive model, the attack is well-localized. 


Cost of forward segmentation. In the heuristic segmentation algorithm 
described above, the number of markers visited depends on the signal; if a long 
frame is chosen, the algorithm advances to the end of the frame and skips over 
the markers in between. Thus, the computation required in the algorithm is 
signal-dependent. To quantify the computational cost, then, the worst case 
scenario is considered; the case in which every marker is visited provides an 
upper bound for the cost. For L = N, the number of segments considered at 
successive markers decreases as the algorithm advances toward the end of the 
signal; for the worst case, the cost is given by 


Cran = N+(N—-1)+(N—-2)+...4+241 (3.27) 
= = (N + N) (3.28) 
= CL=-Nn x N?, (3.29) 
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where the tilde is included in the notation C to specify that the cost corresponds 
to a forward algorithm. For L < N, the worst case cost is given by 


Cren = L+L+...4+L4+L-14+L-2+4+...+24+1 (3.30) 
se 
end of signal 
= NL- 5(I? -1) (3.31) 
— Chen « N. (3.32) 


The costs here are identical to those evaluated for the dynamic algorithm; com- 
pare Equations (3.28) and (3.31) with Equations (3.22) and (3.25). In either 
case, the total number of segments considered in the worst case forward seg- 
mentation is the same as the number considered in the dynamic algorithm. For 
the truncated case, a more optimistic measure of the computation in the for- 
ward approach can be arrived at by an averaging argument. Assuming that the 
segment lengths are all equally reasonable for modeling, and that the expected 
length of a segment chosen by the algorithm is thus (Z + 1)/2, the forward 
algorithm is expected to visit only 2N/(Z+ 1) markers. The cost is then 


5 2NL 
—_— 33 

CL<N Del (3.33) 
— Cren « N, (3.34) 


which has the same dependence on the signal length as the upper bound in 
Equation (3.32); including the dependence on L, however, indicates that the 
average cost is roughly a factor of L/2 less than the worst case upper bound. 


3.4.3  Overlap-Add Synthesis with Time-Varying Windows 


The preceding discussion of segmentation in the sinusoidal model has focused 
on time-domain synthesis. For the sake of completeness, it is noted here that 
adaptive segmentation can also be applied in the frequency-domain synthesis 
method of Section 2.5. The fundamentals of such an approach are discussed 
below and connections to current techniques in audio coding are described. 

In the frequency-domain synthesizer, the signal is modeled as a series of 
short-time spectra, from which the signal is reconstructed using an inverse DFT 
and overlap-add. Each of these short-time spectra is a sum of spectral motifs 
corresponding to short-time partials. The motif is basically the transform of 
some window function b{n], so the IDFT results in a sum of sinusoids windowed 
by b[n]. The overlap-add is then carried out with the hybrid window ¢(n]/b[n] 
where ¢[n] is a triangular window which satisfies the overlap-add property. As 
described in Sections 2.5.1 and 2.5.2, this triangular OLA carries out reasonable 
interpolation of the sinusoidal parameters if phase matching is employed. 

In a multiresolution implementation, it is necessary to incorporate motifs of 
various time resolution; for longer segment sizes, the short-time spectrum has 
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Figure 3.16. Multiresolution frequency-domain synthesis with adaptive segmentation in- 
volves symmetric motif windows and asymmetric interpolation and overlap-add windows. 


more bins and the IDFT is larger. Recalling the developments in Section 2.5, 
it is computationally important to use a symmetric spectral motif and likewise 
a symmetric motif window b(n]. Adhering to this symmetry in a multiresolu- 
tion setting results in asymmetric overlap-add windows; indeed, the interesting 
adjustment of the algorithm involves the overlap-add window and the effec- 
tive interpolating window ¢[n]. Because of the variable segment sizes, to do 
the appropriate OLA interpolation it is necessary to use asymmetric triangular 
windows at transitions between different segment sizes. This approach is best 
described pictorially; Figure 3.16 shows a segmentation and the correspond- 
ing motif and interpolation windows. The asymmetric transition windows are 
conceptually similar to the start and stop windows used in audio coding meth- 
ods that employ window switching. (26, 166]; in those methods, however, the 
asymmetric windows are used in conjunction with a signal-adaptive filter bank 
model and not with a parametric model as in this approach. 


3.5 CONCLUSION 


In modeling nonstationary signals, it is generally useful to carry out analysis- 
synthesis in a multiresolution framework; appropriate time-frequency resolution 
tradeoffs can be adaptively incorporated to achieve accurate compact models. 
In this chapter, the notion of multiresolution was introduced in terms of the 
discrete wavelet transform and further explored in the context of the sinusoidal 
model. T'wo methods of multiresolution sinusoidal modeling were discussed, 
namely filter bank techniques and adaptive time segmentation. A dynamic 
program for signal segmentation was developed; related computation issues 
were considered. Simulations in the chapter showed that multiresolution mod- 
eling improves the localization of transients in the sinusoidal reconstruction; 
this improvement was indicated by a reduction of pre-echo distortion. 


4 RESIDUAL MODELING 


...@ leaf of grass is no less than 
the journey-work of the stars... 


— Walt Whitman, “Song of Myself” 


“Die sinusoidal model, while providing a useful parametric representation 
for signal coding and modification, does not provide either perfect or percep- 
tually lossless reconstruction for most natural signals. Thus, it is necessary to 
separately model the analysis-synthesis residual if high-quality synthesis is de- 
sired; this requirement was the motivation for the deterministic-plus-stochastic 
decomposition proposed in [207, 208]. This chapter discusses a parametric ap- 
proach for perceptually modeling the noiselike residual for both time-domain 
and frequency-domain synthesis. 


4.1 MIXED MODELS 


Mixed models have been applied in many signal processing algorithms. For 
instance, in linear predictive coding (LPC) of speech, the speech signal is typ- 
ically classified as voiced or unvoiced to determine the synthesis model. In the 
voiced case, the synthesis filter is driven by a periodic impulse train; in the 
unvoiced case, the filter is driven by white noise. By choosing an appropriate 
excitation, the model can adapt to a nonstationary signal. In some variations of 
the algorithm, a mixed excitation is used to account for concurrent voiced and 
unvoiced signal behavior; using a mixture enables modeling of a wider range of 
signals than with a switched excitation [97, 120]. The voiced-unvoiced model, 
especially in the case of a mixed excitation, is similar to the deterministic-plus- 
stochastic sinusoidal model decomposition proposed in [207, 208] and explored 
further in [82, 83, 98, 144, 235, 236]. The components in these latter models are 
concurrent in time, which enables representation of a wide variety of signals. 
In Section 2.1.2, where the deterministic-plus-stochastic decomposition was 
first described, it was noted that in the framework of analysis-synthesis it is 
natural to rephrase the decomposition in terms of a signal reconstruction and 
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Figure 4.1. Analysis-synthesis and residual modeling. 


a residual. The reconstruction is based on the signal model, in this case the 
sinusoidal model; the residual is the difference between the original and the 
reconstruction. When the analysis-synthesis model does not capture all of the 
perceptually important features of a signal, it is necessary to separately model 
the residual and incorporate it into the reconstruction to achieve transparency; 
this scenario, which applies in the case of sinusoidal modeling, is depicted in 
Figure 4.1. Such modeling of residuals is used in many audio applications as 
well as in other signal processing algorithms, for instance motion-compensated 
video coding [162]. These approaches are effective because the residuals tend 
to be “noiselike” — in some cases such as LPC, the signal model is indeed de- 
signed with the very intent of leaving a white noise residual. In modeling such 
noiselike residuals, it is important to account for perceptual phenomena. As 
discussed in Section 1.2.2, white noise processes are basically incompressible if 
perfect reconstruction is desired. On the other hand, compact models of noise- 
like residuals can readily achieve perceptual losslessness by incorporating simple 
principles of perception. Furthermore, it should be noted that the condition 
of transparency can be relaxed somewhat for the residual synthesis given the 
perceptual masking principles that come into effect when the modeled residual 
is recombined with the primary signal. The fundamental goal is for the recom- 
bination to be perceptually equivalent to the original signal, and not for the 
synthesized residual to be a transparent version of the original residual. 

In music applications, the sinusoidal model captures the basic musical signal 
features such as the pitch and the spectral structure. The residual contains 
features that are not well-represented by the slowly-evolving sinusoids of the 
sum-of-partials model; these correspond to musically important processes such 
as the breath noise of a flute or saxophone or the attack of a piano or marimba. 
Multiresolution sinusoidal approaches were proposed in Chapter 3 to model the 
attacks, so the residual model of this chapter is designed to handle the remain- 
ing features, namely broadband stochastic processes such as breath noise. It 
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is necessary to incorporate these processes into the reconstruction to achieve 
realistic or natural-sounding synthesis. 


In (207, 208], the residual is modeled using a piecewise-linear spectral es- 
timate; a random phase is applied to this spectrum, and an inverse discrete 
Fourier transform followed by overlap-add is used for synthesis. In the approach 
to be discussed in this chapter, the model is similarly spectral in nature, but 
is more directly based on perceptual considerations. The residual is analyzed 
by a filter bank whose structure is motivated by auditory perception of broad- 
band noise; a parameterization provided by the short-time energy of the filter 
bank subbands yields a perceptually accurate reconstruction of the residual. 
Furthermore, the model parameters allow for modifications of the residual; this 
capability is useful in that if the sinusoidal signal components are modified, the 
residual should undergo a corresponding transformation prior to synthesis [68]. 


In [98, 235, 236] the models are more elaborate than the one presented in this 
chapter in that they have specific extensions to model attack artifacts present 
in the residual. The approach taken here is to use multiresolution sinusoidal 
modeling to minimize such artifacts so that they do not appear in the residual 
and thus do not have to be accounted for in the residual model. A similar 
approach is taken in the algorithm in [76], which estimates the time-domain 
envelope of the signal and applies it to the sinusoidal model to enhance the 
modeling of transients. This method involves incorporating another set of pa- 
rameters to describe the time-domain envelope, however, so the multiresolution 
model has an advantage in that its representation is more uniform. 


Figure 4.2 gives a comparison of the residuals for a basic sinusoidal model and 
a multiresolution model based on dynamic segmentation as developed in Section 
3.4. Clearly, the attack artifacts are not as pronounced in the residuals of the 
multiresolution model. Because of its improved ability to represent the signal 
transients, the dynamic model results in a lower residual energy; as discussed in 
Section 3.4, the multiresolution model is adapted to minimize this energy given 
various constraints such as the number of sinusoids in the model. This notion of 
minimizing the residual energy is also incorporated in the analysis-by-synthesis 
algorithm discussed in [76] and in global parameter optimization methods [46]. 
Also, in the methods to be discussed in later chapters, minimization of the 
residual energy is again the criterion by which the signal model is adapted. 


4.2 MODEL OF NOISE PERCEPTION 


From this point on, it is assumed that attack transients have been well-modeled 
in a multiresolution framework. The residual thus consists of broadband noise 
processes. A perceptually viable model for the residual should therefore rely on 
a model of how the auditory system perceives broadband noise. This section 
discusses a simple filter bank model of the auditory system that leads to a 
perceptually lossless representation of the residual. 
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Figure 4.2. Comparison of residuals for fixed and multiresolution sinusoidal models. The 
original signal (a) is a saxophone note. Plot (b) is a reconstruction based on a fixed frame 
size of 1024 and (c) is the residual for that case. Plot (d) is a reconstruction using dynamic 
segmentation with frame sizes 512 and 1024; in this case, the attack is well-modeled and 
does not appear as extensively in the residual (e). 


4.2.1 Auditory Models 


Auditory models commonly include a set of overlapping bandpass filters whose 
bandwidths increase roughly in proportion to their center frequencies. Such fil- 
ter bank models, which were first introduced in conjunction with the classical 
theory of resonance [101], are well justified by experimental work ranging from 
early masking tests for telephony applications [64, 65] to recent investigations 
in perceptual audio coding, where auditory models are incorporated to achieve 
transparent compression (26, 91, 92, 166, 223]. These auditory filter banks can 
be characterized in terms of the classical critical bandwidths , which were de- 
rived in experiments on noise masking and perception of complex sounds; these 
are generally considered to be the bandwidths of the auditory filters at certain 
center frequencies [248]. Early estimates of the critical bandwidth as a function 
of center frequency indicate a roughly constant value below 500 Hz and a linear 
increase for higher frequencies, resulting in the common interpretation of the 
auditory system as a constant-Q filter bank. More recent experiments suggest 
that the low-frequency critical bandwidths are quadratically related to the cen- 
ter frequency [154]. Expressions for the equivalent rectangular bandwidths of the 
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Figure 4.3. Bandwidth vs. center frequency for critical bands (dashed) and equivalent 
rectangular bands or ERBs (solid). 


auditory filters differ somewhat from the bandwidth formulations in classical 
critical band theory; the difference is depicted in Figure 4.3. Of course, these 
results are based on aggregate measurements over large groups of subjects, so 
the exact relation does not necessarily apply to any given individual. Further- 
more, for this application of residual modeling it is unnecessary to incorporate 
formal exactitudes about the auditory filter responses because the perception 
of broadband noise is an inherently coarse phenomenon. The purpose of the 
current discussion, then, is only to support the notion of filter bank auditory 
models and to establish the terminology. For the remainder of the chapter, an 
equivalent rectangular band will be referred to as an ERB. . 


4.2.2 Filter Bank Formulation 


A simple model of noise perception can be arrived at by dividing the spec- 
trum into a set of bands based on the ERB formulation. Given this division 
into bands, the basic model is that in perceiving a broadband noise, the au- 
ditory system is primarily sensitive to the total short-time energy in each of 
the bands, and not to the specific distribution of energy within any single 
band. In other words, the ear is insensitive to specific local time or frequency 
behavior of broadband noise. Operating under this assumption, analysis of a 
broadband noise s[n], which corresponds to r[n] in the residual modeling frame- 
work of Figure 4.1, is carried out by first applying s[n] to an ERB filter bank 
{hi|n], he[n],... ,2e[n]} to derive the ERB signals {s;[n], so[n],... , sa[n]} as 
shown in Figure 4.4. These signals are then parameterized on a frame-rate 
basis in terms of their energies; for the i-th frame, the energy of the r-th ERB 
signal is given by 
N-1 
E,(i) = >> s,[n +i), (4.1) 


n=0 


where N is the frame size and L is the analysis stride. Synthesis according 
to this model is achieved by filtering white noise 7[n] through the ERB filter 
bank with a time-varying gain c,(i) on each channel; this structure is shown in 
Figure 4.5. 
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Figure 4.4. Analysis filter bank for perceptually modeling broadband noise. The residual 
is parameterized in terms of the short-time energies E,.(4) in a set of equivalent rectangular 
bands (ERBs). 


The time-varying gains in the synthesis filter bank shape the short-time spec- 
trum of the filter bank output s[n] so that it matches the short-time spectrum 
of s[n] in the sense that their ERB energies are equivalent. The appropriate 
gain can be derived using a simple constraint on the expected value of the 
synthesis energy: 


E{E,(i)} = Ep(i). (4.2) 


Note that this filter bank model relies on the aggregation of filters only inas- 
much as they span the signal spectrum; the interaction between filters is not 
important. The model is simply that the subband ERB signal s,[n] is per- 
ceptually equivalent to the subband reconstruction &,[n] if their short-time 
energies meet the above constraint; then, if the filter bank is designed such 
that s[n] = )}°,s,[n], perceptual losslessness holds for the entire filter bank 
model. 

The appropriate gains can be derived by expanding the constraint of Equa- 
tion (4.2). The expected value of the synthesis energy of the r-th band in the 
i-th frame is given by 


N-1 
E{E,()} = E > (iri + ex)", (4.3) 
n=0 
where 
3,[n] = hp[n] * fn] (4.4) 


is the output of the r-th synthesis filter before the gain c,(i) is applied. Substi- 
tuting this convolution into Equation (4.3) yields the following expression; the 
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Figure 4.5. Synthesis filter bank for perceptually modeling broadband noise. The time- 
varying gains c,(i) given by Equation (4.8) shape the short-time spectrum of 8[n] to match 
that of s[n] in Figure 4.4. 


index iL is dropped without loss of generality: 


N-1 2 
E{E,(i)} = eli)? e (x he[m|yp{n — ) (4.5) 
n=0 m 


N-1 
= (i)? 2 > relmaeE(Y[n- m)y[n—}. (4.6) 


n=0 m |! 


Denoting the variance of the white noise ~[n] as o”, the expected value in the 
sum can be replaced by o76[m — 1]. Summing over / then yields 


E{E;,(i)} = en(é)*No® J helm). (4.7) 


Note that the filters have been assumed real so that the subband signals are 
real and thus immediately perceptually meaningful. Combining Equations (4.2) 
and (4.7) yields a formula for the gain in terms of the ERB energy parameter: 


The above equation can be interpreted in two ways. First, the channel gain 
c,(z) can be thought of as a frequency domain ratio between the ERB energy 
in band r measured by the analysis and the energy at the output of the r-th 
synthesis filter: 


K-1 
E-(i) = e(i)? €: 3 tee (190°) , (4.9) 
k=0 
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Rewriting this in terms of the time-domain noise input yields 


c,(i)? K-1 N-1N-1 
Ei) = = » Hetere > vnvinjerrne-mr (4.10) 

«)°e? NaN 
= (= (kl? S> S> fn — mef?7 Flr "vn (4.11) 

n=0 m=0 

cr(i)*07N 2 
= —— )_ |A-lAll (4.12) 
K > | 

= ¢,(i)?o?N 5 h,[m]?, (4.13) 


which can be manipulated to give Equation (4.8). The second interpretation 
is based on equalizing the short-time variances of the subband signal s,[n] and 
its estimate 8,[n]. A slightly biased estimate of the variance of s,[n] in the i-th 
frame is given by [167]: 


, Na 
var(s,;[n]) = N S> s,[n + iL]? 


n=0 


_ E,(i) 
= (4.14) 


The variance of 8,[n] in the i-th frame can be derived by considering the effect 
of a linear filter on the autocorrelation of a stochastic process: 


E {8,[n]8,[n + ¢]} 


E 3 cn(i)he[mhpln — m] > cr (i)hell] pln +t a} (4.15) 
m l 


= (i)? > >) helmJh, [JE {p[n — m]p[n +t — I} (4.16) 
m 
= o7c,(i)? > >— h,[mJh,[l]6[l — m — ¢] (4.17) 
m = 
= o°7¢,(i)? }) hp[mh,[m + #]. (4.18) 


Evaluating at ¢ = 0 yields the variance as 
var(8,,i[n]) = o7¢,(i)? > hy[m)?. (4.19) 


Combining the expressions in (4.14) and (4.19) again yields the gain formula 
of Equation (4.8). This second perspective shows that this formulation does 
not involve strict process matching in the autocorrelation sense; rather, a loose 
matching is achieved in the sense that the local autocorrelations of the pro- 
cesses s,(n] and &8,[n] are equalized in the first order. In this light, the filter 
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bank analysis-synthesis can be interpreted as a first-order subband linear pre- 
dictive coding system. Higher order LPC methods, while designed to model 
locally stationary random processes, are not particularly useful for this mod- 
eling scenario since the parameterization is not tightly coupled to perceptual 
factors [207]. 

The formulation above can be rephrased in terms of the spectral densities of 
the original and reconstructed processes. This provides a more intuitive expla- 
nation of the filter bank residual model than the variance matching framework, 
and relates the two interpretations given in the preceding paragraph. Using 
the Parseval relation 


1 20 wy 2 
Shel? = 5 [ IH, (e) |? dus, (4.20) 
the subband gain from Equation (4.8) can be rewritten as 
; 27 =| 1 
c,(i)? = — || =. 4.21 
Oo | NJ fo” He (e#))? deo on 


The term E,(i)/N is a variance estimate as established in Equation (4.14); 
then, since the variance of a random process is the average value of its power 
spectral density (PSD), the gain can be rewritten as 


1 fo” Sri (e%) dw 
o fo" |H, (ei) |? du’ 
where S,,; (e/”) is the PSD of the r-th subband signal in the analysis filter 


bank in the i-th frame. The numerator in the above expression can be written 
in terms of the PSD of the original signal s/n]: 


1 fy” 5; (&) [Hr (e) |" dw 
oF fo” |r (e¥) ? dw 
This expression indicates that the gain for the r-th band is based on the average 
value of the input PSD over the r-th band; the a” term normalizes the variance 
of the white noise source [n] in the synthesis filter bank. 

Using Equation (4.23), the PSDs of the original and reconstructed subband 


signals can be related. The PSD of the synthesized process §,[n] in the i-th 
frame is given by 


c, (i)? = (4.22) 


c, (i)? = (4.23) 


Sri (e”) = a7 e,(i)? |H, (e7”) \" ; (4.24) 
so substituting for c,(z) yields the relationship 


HH, (2%) |" fo” Sus (e%) dw 


Si (e™ 4.25 
” 0” Ler (eH)? dw ee) 
H, jw 2 p2Qr 5; jw H, jw 2 

— [Hr le™)I fo" Silo) [He (e™)F dy ig ogy 


2m VHT, (63) |? dw 
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This derivation shows that the ERB parameterization leads to a reconstruc- 
tion whose subband power spectra correspond to averages of the input power 
spectra over the various bands of the filter bank. The formal relationship be- 
tween the PSD of the fully reconstructed signal and the original signal is more 
complicated, however, since cross terms are introduced in the output PSD be- 
cause the subband signals are not independent. The constraints required to 
achieve such independence substantially restrict the filter bank design and are 
thus not incorporated; also, since the perceptual model is based on subbands, 
considerations regarding the PSD of the full output are not called for. 

The result of Equation (4.8) clearly holds for the case L = N, where the gain 
is simply updated for each new synthesis frame. Abrupt gain changes at frame 
boundaries may cause discontinuities in the output; an alternative approach is 
to use L = N/2 and carry out an overlap-add process to construct the output. 
Then, the above gain calculation can also be applied, provided that the window 
overlap-adds to one for a stride of N/2 and that the energy in a given band 
does not change drastically from frame to frame. 

This filter bank approach is useful for modeling the noiselike residual of the 
sinusoidal model in that it provides a small set of parameters that describe the 
general time-frequency behavior of the stochastic component. For example, the 
model is effective at the sample rate f, = 44.1kHz for R = 12 bands with a 
frame size of N = 256 and a stride of L = 128; in this case, the residual signal 
is essentially downsampled by a factor of ten into a transparent parametric 
representation. The original and the synthesized signals have the same general 
time-frequency behavior, and because the ear is mostly insensitive to the fine 
details of a noiselike signal, this analysis-synthesis of the stochastic component 
is basically perceptually lossless. Greater compaction can be readily achieved 
by using larger frames and longer strides; the case above was cited in particular 
since it fits directly into the specific structure of the frequency-domain synthe- 
sizer discussed in Section 2.5. Also, to control the amount of model data, the 
number of bands can be increased or decreased simply by scaling the bandwidth 
of each ERB by a common factor. Finally, note that the length of the analysis 
frames and strides can be time-varied to estimate the residual parameters for a 
dynamically segmented sinusoidal model, t.e. a multiresolution model; alterna- 
tively, model parameters can be estimated at arbitrary times by interpolating 
between estimates evaluated at regularly spaced times, but such interpolation 
assumes a certain smoothness in the parameter evolution. 


4.2.3 Requirements for Residual Coding 


The filter bank model of the sinusoidal analysis-synthesis residual meets three 
basic requirements for residual coding that have been established in the preced- 
ing discussions, namely compaction, perceptual relevance, and transparency. 
Compaction is especially desirable since the residual is secondary in impor- 
tance to the primary reconstruction; perceptual relevance is of interest since it 
allows meaningful modifications to be carried out. The last condition, that of 
perceptual losslessness, is of course important for any audio signal model; in 
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residual modeling, there is some leeway due to masking effects that occur upon 
combination with the primary reconstruction. 

In addition to the criteria discussed above, another useful feature of a resid- 
ual model is the ability to economically recombine the residual parameters with 
the parameters of the primary signal model prior to reconstruction. In the 
frequency-domain sinusoidal synthesizer, some computation is saved by using 
the ERB model to derive a spectral representation of the residual that can be 
combined with the sinusoidal spectrum before the IDFT. This DFT-based ap- 
proach is discussed further in the next section; a time-domain implementation 
of the filter bank model is also presented. 


4.3. RESIDUAL ANALYSIS-SYNTHESIS 


The filter bank model of broadband noise perception can be implemented in 
the time domain as formulated in the previous section. For frequency-domain 
sinusoidal synthesis, the model can be rephrased in terms of the DFT to allow 
a merged synthesis of the partials and the residual component. Details of both 
approaches are given below. 


4.3.1 Filter Bank Implementation 


The filter bank for the residual model is subject to looser design constraints 
than critically sampled filter banks [82]. In this section, these constraints are 
discussed and a simple design approach is given. 


Perfect reconstruction constraints. Perfect reconstruction filter banks 
were discussed at length in Section 2.2.1; recall that in the subsampled case, 
aliasing introduced by the analysis is cancelled in the synthesis filtering pro- 
cess. Then, the requirement of a distortionless input-output transfer function 
along with this aliasing cancellation provides a set of design constraints for the 
filter bank. Due to the various advantages of subband processing, such filter 
bank approaches have been widely dealt with in the literature, but primarily 
for the case of uniform or octave-band filter banks [232, 238]. Some results 
on nonuniform critically sampled and oversampled perfect reconstruction filter 
banks have also been presented [40, 117, 133, 161, 176]. 

The design of a nonuniform filter bank for the noise perception model pro- 
posed in Section 4.2 differs from the perfect reconstruction problem discussed 
above. In the perception model, the ERB analysis filter bank provides a set of 
subband signals from which short-time gains are derived; for synthesis, these 
gains are applied to the subbands of an ERB filter bank driven by white noise. 
This framework is quite different from a typical critically sampled analysis- 
synthesis filter bank, so the filter bank design is subject to different constraints 
than those of a critically sampled system. A sensible perfect reconstruction 
constraint for the ERB filter bank is that the sum of the subband signals 
should equal the original signal; then, no distortion is introduced in deriving 
the subband ERB signals. For an R-band filter bank, this constraint corre- 
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sponds simply to: 


R R 
S°s-[n] = s(n] <> SCh,[n] = S[n]. (4.27) 


Scaling and delay are of course allowed since such effects can be readily com- 
pensated for in this application. Given the subband perfect reconstruction 
constraint, the only other issue is that arbitrary passband edges should be al- 
lowed for the filters at the design stage; such design flexibility enables a wider 
range of experiments, for instance with variable band allocation, than in a rigid 
approach. The filter bank design is discussed below. 


Filter bank design. Given a set of arbitrary frequency band edges spanning 
from 0 to the Nyquist frequency f,/2, where the set will be denoted by 


fedges = {fo fi see fr see fr-1 fr} (4.28) 
with fo = 0 and fr = f,/2, which corresponds in radian frequency to 


27 
Wedges —= {Bo ®, eee ®,. eee @p_j Br} = 7, Fedees (4.29) 
8 


with ®y9 = 0 and ®r = 7, consider ideal bandpass filters of the form 


_ Ar sin(A,n/2) 
b-[n] = - cos(w,n) (ea 72 ). (4.30) 
where 
A, = ©, — ®,_} (4.31) 


is the bandwidth of the r-th filter and 


_ o, + ®,_1 


Wr = 5 (4.32) 


is the center frequency of the positive frequency passband of the r-th filter; be- 
cause the filters are real, each has a negative frequency passband as well. Since 
the R bands are nonoverlapping and span the entire spectrum by definition, 
the frequency responses B,(e) of the corresponding R ideal bandpass filters 
simply add up to one: 


b(n] = d{nl, (4.33) 


1 


R R 
> B, (ee) =l = 
r=1 r= 


which shows that this ideal filter bank satisfies the subband perfect reconstruc- 
tion constraint of Equation (4.27). 
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Figure 4.6. The window method for filter design. Time-domain multiplication corresponds 
to frequency-domain convolution, so windowing the sinc function as in (a) corresponds to 
the convolution shown in (b); the resulting FIR filter shown in (c) has the nonideal frequency 
response shown in (d), where the ideal filter (dashed) is included for comparison. 


The ideal filter bank {B, (e7”)} consists of two-sided IIR filters that are not 
realizable. However, a realizable FIR filter bank that satisfies the subband per- 
fect reconstruction constraint can be derived from the ideal filter bank by using 
the window method of FIR filter design; this method suggests that a realizable 
FIR approximation of an ideal filter can be obtained by time-windowing the 
ideal filter’s impulse response [165]. The frequency response of the approximate 
filter is given by a convolution of the ideal filter response and the transform of 
the window, which results in a smearing of the ideal response: 


Rapprox[n] = f[n] Rideai[n] <=> Happrox (e%”) = F (e%”) * Hiaea (e7”). (4.34) 


This window-based approximation process is depicted in Figure 4.6; the ap- 
proximate filter has transition regions in the frequency domain where the ideal 
filter has sharp cutoffs; also, ripples appear in the frequency response of the 
approximate filter. 

In designing single filters, the window method leads to approximate realiza- 
tions. In the filter bank case, however, it is possible to satisfy the subband 
perfect reconstruction condition exactly with realizable filters based on the 
window method. Introducing a window f[n] on both sides of the right-hand 
expression in Equation (4.33) yields 


R 
f[n] >—b,[n] = d[n]f[n] (4.35) 
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R 


R 
= Di finlb- in} = Soh, [n] = d[n] f(O], (4.36) 


r=1 


where h(n] = f[n]b,[n], which shows that the window-based filter bank satisfies 
the perfect reconstruction constraint, provided that f[0] is nonzero. This result 
can also be derived in the frequency domain, where application of the window 
f|n] corresponds to a convolution with the window transform: 


R 
F (e%”) x S—B; (e”) = F(e”) «1 (4.37) 
r=1 
x . 1 7 _., 
=> \_ A, (e”) = = | F (e!”) dw = f(0], (4.38) 
r=1 0 
where H,(e”) = F(e!”) * B,(ei”). The convolution of F(e!”) with the 


unity response of the ideal filter bank sum is simply equivalent to a full-band 
integration, the result of which is the constant f[0]. This shows again that 
the nonideal filters H,(e!”) satisfy the constraint of Equation (4.27); in the 
frequency-domain sum, the transition regions and ripples of a given filter are 
counteracted by contributions from the other filters. While this filter design 
method is useful for the application at hand, it should be noted that it does not 
readily apply to the design of subsampled perfect reconstruction filter banks. 

The derivation in Equations (4.35) through (4.38) shows that the only re- 
striction on the window f[n] is that it be nonzero at n = 0; it can thus be 
used to vary the response of the filters without affecting the perfect reconstruc- 
tion property. One useful choice for f[n] is the raised cosine pulse, common 
in digital communication applications, which enables the filter responses to be 
controlled by way of the excess bandwidth parameter a [126]. The raised cosine 
is defined as 


Aan 
cos (73") —-M<n<M 
fin) = 2 1-H) (4.39) 
0 otherwise, 


where A is the filter bandwidth; the length of filters designed with this window 
is 2M +1. Since the same window must be applied to all the filters, the product 
Aq is a constant for the filter bank; in other words, A,;a, = Aj,q, holds for 
any of the R filters in the filter bank. Then, since the filter bandwidths differ, 
there is essentially a different excess bandwidth parameter for each filter. In 
choosing the excess bandwidth, there is thus only one degree of freedom, which 
implies that the overlap between adjacent filters will behave similarly across 
the entire spectrum. Filter bank responses based on this design are shown in 
Figure 4.7 for varying M and ay, the excess bandwidth of the first filter. 
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Figure 4.7. Frequency responses for a 6-band filter bank with (a) M = 20 anda, = 
0.5, (b) M =40 anda; = 0.5, and (c) M = 40 anda, = 0.95. 


Figure 4.7 indicates the flexibility of this design procedure: the band edges 
are arbitrary, the filter length is arbitrary but the same for each band, and 
the filter ripple and transition behavior are readily controllable. Beyond the 
standard time-frequency resolution tradeoffs in filter design, the flexibility of 
the filter response is limited only in that the formulation requires that the same 
window function f[n] be used for each filter in the filter bank. The choice of 
window essentially limits the frequency resolution of the narrowest band in the 
filter bank. For wide bands, the sinc impulse response of the underlying ideal 
filter is narrow, so a long, smooth window will not affect the response of the 
windowed filter drastically; for narrow bands, on the other hand, the time 
domain sinc response is spread out. To maintain the frequency resolution of 
the narrowest band, then, it is necessary that the window be long enough to 
cover the majority of the energy of narrowband sinc function. 

This design approach has proven useful for the ERB-based stochastic signal 
model; the ease and flexibility of the design allow for a wide variety of ex- 
periments involving reallocating the frequency bands and trading off the time- 
frequency resolution of the ERB parameterization. 


4.3.2 DFT-Based Implementation 


For the frequency-domain synthesizer discussed in Section 2.5, it is compu- 
tationally advantageous to derive a representation of the residual that can be 
combined with the spectrum of the partials before the inverse Fourier transform 
is carried out. It is thus useful to devise a DF T-based algorithm for modeling 
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the residual; analysis, synthesis, and normalization issues are discussed below. 
Note that both the analysis and synthesis can be implemented with fast Fourier 
methods. 


Residual analysis. Analysis for the ERB residual model can be carried out 
using the short-time Fourier transform [83]. As in the sinusoidal analysis, a slid- 
ing window is used to extract time-domain frames, and each frame is analyzed 
with the discrete Fourier transform. Specifically, a sliding window w[n — iL] of 
length N is used to isolate frames of the residual s[n] at times spaced by the 
analysis hop size L. The frame signal w[n — iL]s[n] is then transformed into 
the short-time spectrum S(k,i) by a DFT of size K, where K > N. The values 
of N, K, and L need not correspond to those used in the sinusoidal analysis. 

After the DFT, the spectrum is simply divided into bands according to the 
ERB model; without degradation of the model, the bandwidths of the ERBs 
can be scaled by a common factor to cover the spectrum with fewer bands and 
thereby achieve data compression. After the band allocation is established, the 
energy in each of the bands is computed from the DFT magnitudes; since the 
spectrum is conjugate symmetric, the negative frequency components are not 
included: 


EB) = => ISP, (4.40) 
k€G, 


where £, denotes the bins that fall in the r-th ERB; this shorthand will be 
used throughout the chapter. In this DFT-based analysis, these energies serve 
as the residual parameters for the i-th frame; changes in the characteristics of 
the residual are reflected in frame-to-frame variations of the ERB energies. 

The energies E,(i) are not entirely the same as the E,(7) formulated in the 
filter bank analysis, but both energy measures F£,(¢) and E,(i) are conceptually 
suitable for the psychoacoustic model, which is namely that the perceptual 
qualities of broadband noise are determined by the total energy in each band 
and not by the specific distribution of energy within the bands. The distinction 
between E,(i) and E,(i) is discussed further later. Note that the phase of S(k, i) 
is irrelevant to the ERB energy calculation, which is justified since the auditory 
system is primarily sensitive to the magnitude of the short-time spectrum. This 
insensitivity to phase is especially applicable to the case of broadband noise, 
where the phase is itself a stochastic process; in such cases, the percept is 
basically independent of the phase distribution. 


Residual synthesis. The modeled residual can be synthesized with an IDFT 
followed by OLA. For a given frame, the ERB energies are converted into a 
piecewise constant spectrum wherein the magnitude of each constant piece is 
determined by the corresponding ERB analysis parameter; these magnitudes 
correspond to the gains of the time-domain filter bank model. An example of 
this is given in Figure 4.8, which shows the magnitude spectrum of an analysis 
frame and the corresponding piecewise constant spectral estimate for synthesis 
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Figure 4.8. Piecewise constant ERB estimate (solid) of the residual magnitude spectrum 
(dotted) for a frame of a breathy saxophone note. 


based on twelve ERBs. Synthesis using piecewise linear spectral estimates, 
sloped within each ERB to fit the analysis spectrum, gives a reconstruction of 
the same perceptual quality as the piecewise constant approach, which verifies 
the assertion that the ear is not sensitive to the specific spectral distribution 
within each ERB. 

For the sake of input-output equalization, it is important to preserve the 
ERB energies in the analysis-synthesis pathway; this is accounted for in the 
following equations, where S(k,i) denotes the analysis DFT for the i-th frame, 
§ («, 1) denotes the piecewise constant spectral estimate derived in the synthesis, 
€, is the number of bins in the r-th ERB at the synthesis stage, and M is the 
size of the synthesis IDFT. Note that the analysis transform and the synthesis 
transform do not have to be the same size. Accordingly, the bins §/ in the r-th 
synthesis band are not necessarily the same as the bins G, in the r-th analysis 
band; also, note that the following formula uses distinct bin indices k and «: 


mn 1 a, 1 . 
Ei) = 7 DIS )P = ZL ISA. (4.41) 
KEG! k€G; 


In the spectral estimate, every bin in a given synthesis band takes on the same 
value; for any « € G7, the above equation can thus be rewritten as: 


Bi) = *[8(n,)P => 15(e,)| = 7 Ex). (4.42) 


Energy normalization will be considered further in a later section. 

After the magnitude spectrum is constructed, a uniform random phase is 
applied on a bin-by-bin basis. Frame-to-frame phase correlations can be intro- 
duced to control the texture of the synthesized residual; for instance, varying 
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the smoothness of the residual may be musically desirable. After the phase 
is incorporated, the spectrum of the residual model and the partial spectrum 
are summed (in rectangular coordinates) and transformed into a time-domain 
signal by the IDFT and OLA. This approach has proven perceptually viable 
for broadband residuals such as saxophone and flute breath noise. 


Comparison of DFT and filter bank analysis-synthesis methods. While 
founded on the same psychoacoustic principle, the DFT-based model of the 
residual discussed in this section and the filter bank formulation of Section 
4.2.2 provide different ERB energies for the model. Perceptually, the two meth- 
ods yield similar results; a mathematical comparison, however, shows that the 
residual models are indeed different. 

The difference between the two methods can be formalized using the short- 
time Fourier transform. Some restrictions must be imposed to compare the 
methods; these will be introduced as the framework is developed. It should be 
noted that the difference between the ERB parameters depends on the analysis, 
so the synthesis filter bank will not enter the discussion. 

In the DFT method, the analysis with the sliding window w[n] can be im- 
mediately interpreted as a modulated STFT filter bank of the form shown in 
Figure 2.3, with analysis filters given by w[—n]eJ“*". From Section 2.2.1, the 
STFT of s[n] with subsampling by L is given by 


N-1 
S(k,t) = > w([n]s[n + iL]e34*", (4.43) 


n=0 
and the ERB parameters in the DFT method, as defined earlier, are given by 
~ 1 . 
Ei) = = > |SaP. (4.44) 
KEG; 


Then, summing the band energies across the spectrum yields the signal energy 
of Parseval’s theorem: 


R R 
40 = so vse (4.45) 
r=l r=1 kEBp 
1 K-1 ; N-1 
= Fd S&AIP = Yo lwin|sin+izj?. (4.46) 
k=0 n=0 


As will be seen, such a summation does not generally apply in the filter bank 
case; the sum of the subband energies is not proportional to the energy of the 
input signal unless the filter bank corresponds to a tight frame [238]. 

In considering the filter bank approach, various restrictions must be imposed 
to allow for a meaningful comparison with the DFT method. First, the filters 
are restricted to be of the form 


h(n] = f[n]b,[nJe?"r”, (4.47) 
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where f[n] is a window function, b,[n] is an ideal filter, and w, = 27k,/K, 
a bin frequency of a K-point DFT. Unlike earlier, these filters are defined to 
be complex; this allows for straightforward comparisons to the complex STFT 
filter bank. For real filters, a scale factor of two is necessary in some of the 
calculations to account for the negative frequency components. 

Given the above restriction on w,, b,[n] can be written as 


b,[n] = $ b[njer27*n/K (4.48) 
k=—€, 
where 
bin] = sin(an/*). (4.49) 


which corresponds in the frequency domain to an ideal filter of bandwidth 
2n/K, which is the width of one bin in a K-point DFT. The sum of modu- 
lated sinc functions in Equation (4.48) is then just an ideal filter of bandwidth 
2n(2e, + 1)/K. 

The last required restriction in this filter bank consideration is that the 
window function w[n] must satisfy 


w[n| = f[-n]b[—n], (4.50) 
meaning that the DFT analysis window w|n] must be a windowed and time- 
reversed version of the impulse response of a narrowband sinc function. If 


this condition is met, it follows that any filter in the nonuniform filter bank 
corresponds to a sum of adjacent STFT filters: 


he[n] = f{nlbe[njern (451) 
= f[njei’” > b[n]ei27hn/K (4.52) 
k=—€, 
kp +ér 
= So finb[njei?7*n/* (4.53) 
k=k,—€p 
kr+€r 
_ > w[—njes27*n/K (4.54) 
k=k,—€> 


The r-th subband signal of the ERB filter bank thus corresponds simply to a 
sum of the outputs of the STFT filters in the band: 


s,[n] = >_> S[k,nl, (4.55) 
keGr 
where the STFT S[k, n] is not subsampled. The ERB energies in the filter bank 
approach are then given by 
iL+N-1 


Ei) = > | 3 sen’ (4.56) 


n=1L keG, 
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In this case, the sum across bands does not yield the same result as the DFT 
method. This disparity occurs because of the nonlinearity of the magnitude 
function; in the DFT method, the magnitude is taken before the subband 
signals are summed; in the filter bank method, the magnitude is taken at a 
different point, namely after the subbands are added together. 

The DFT and filter bank methods are mathematically distinct as derived 
above. However, they exhibit some type of equivalence in that the perceptual 
merits of the models are similar. This equivalence, despite the formal differ- 
ence, indicates that a certain crudeness or inexactness can be incorporated into 
residual models without causing adverse effects; this is especially true if the 
inexactness is based on simple psychoacoustics. 


An aside on Parseval’s theorem. The filter bank residual model relies 
on the equivalence of time-domain and transform-domain signal energies; this 
equivalence is referred to as Parseval’s theorem or relation. Parseval’s theorem 
holds for any orthogonal basis, and a similar expression can be derived for the 
case of tight frames [238]. In this section, issues related to frequency-domain 
energies are considered. It should be noted these issues are not intrinsically 
coupled to the application of residual modeling, but indeed apply to any signals. 

The frequency-domain representations of interest here are the discrete-time 
Fourier transform and the discrete Fourier transform. Considering Parseval’s 
relation for these two cases leads to an interesting result. For a discrete-time 
signal 2[n]| of length N, the energy can be expressed in terms of the DIFT or 
the DFT, which is the uniformly sampled DTFT as discussed in Section 2.5.1: 


N-1 T 
> Ie[n]|?_ = 5 / |X (e%) |° duo DTFT (4.57) 
n=0 —e 
1 K-1 
= 7 > IXIA? DFT with K>N (4.58) 
k=0 
K-1 
1 | 2 
-_ = X (ef?nk/k | (4.59) 
K k=0 ) 


The right-hand expressions in Equations (4.57) and (4.59) can be equated and 
manipulated into the form 


[aera Slo). am 


The left side is simply the integral of the magnitude-squared of the DTFT. The 
right side can be interpreted as a piecewise approximation of the continuous 
integral; the width of a piece is 27/K and the height of the piece spanning 
from frequency 27k/K to 27(k+1)/K is |X[k]|?, the squared magnitude of the 
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Squared magnitude (linear) 


Frequency (radians) 


Figure 4.9. Examples of exact stepwise integration for two spectra. As shown in Equation 
(4.60), Parseval’s theorem indicates that the stepwise approximation of the DTFT squared- 
magnitude based on the DFT is exact if the DFT is large enough that no time-domain 
aliasing is introduced. 


DTFT sample at 27k /K. This stepwise integration is illustrated in Figure 4.9. 
For the squared magnitude of the DTFT, there is no error in approximating 
the integral in this fashion as long as the DFT is large enough, #.e. there are 
enough samples of the DTFT. Essentially, this condition holds because the 
signal is time-limited; the notion is analogous to the familiar result that a 
bandlimited signal can be perfectly reconstructed from an appropriate set of 
samples [216]. This issue is mostly an aside from the discussion of residual 
modeling, so it will not be considered further. 


Normalization. To achieve perceptual losslessness in a deterministic-plus- 
stochastic or reconstruction-plus-residual model, it is necessary that the rela- 
tive perceptual strengths of the two components be preserved by the system. 
The spectral peak picking described in Chapter 2 provides the proper ampli- 
tudes for equalized sinusoidal synthesis. In the residual model, the loudness 
equalization is based on preserving the short-time energy of the signal (in 
a stochastic sense); such energy preservation was the basis for deriving the 
short-time gains for the synthesis filter bank discussed in Section 4.2.2. In the 
frequency-domain synthesizer, the various operations mandate careful consid- 
erations of their effects on the short-time signal energy. Relative equalization 
of the subband energies is straightforward; however, the various windowing and 
overlap-add operations introduce gain changes that must be compensated for. 

The DFT-based residual analysis-synthesis is depicted in Figure 4.10. With 
the exception of the transparency requirement, the ERB energy parameters 
in this model meet the criteria discussed in Section 4.2.3; namely, the ERB 
energies comprise a small set of perceptually meaningful parameters that can 
be readily combined with the partials before the IDFT. To meet the final re- 
quirement of perceptual losslessness, signal scaling must be accounted for; the 
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Figure 4.10. Block diagram of the DFT-based residual analysis-synthesis. The first three 
blocks constitute the analysis. 


multiple windowing steps, as well as differences in the analysis and synthesis 
frame sizes and sampling rates, affect the loudness of the synthesized residual 
in the DF T-based model, so the reconstruction must be equalized to match the 
loudness of the original residual. 

The proper equalization of the residual can be derived by considering the 
energy in the continuous-time signal. For an input segment of length 7, corre- 
sponding to N samples at the rate f, = 1/T,, the continuous-time signal has 
energy 


To+Ta N-1 
BL = [stat x SY sire +nTa}*Ts, (4.61) 
T 


0 n=0 


where the ~ refers to the approximation of the integral by the sum of the areas 

of rectangles of width T, and height s[t) +nT,]*. The subscript a refers to the 

analysis stage; later, the subscript s will be used to refer to the synthesis stage. 
In discrete time, the energy of an analysis frame of length N is 


E, = > s(n]? = tT (4.62) 


The expected value of this energy, which will be used as an energy estimate, is 
simply given by 


N-1 


E{E,.} = >  E{s(n]?} = NE{s{[n]’}. (4.63) 


n=0 


This frame energy is now traced through the system; note that, as before, the 
frame index is dropped without loss of generality. 
First, the output w[n]s[n] of the analysis window has energy 


N-1 


Ew = S— w[n]?s[n]?. (4.64) 


n=0 
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As in the derivation for the time-domain filter bank, the expected value of 
the energy is now used as a metric; replacing s[n]* by its expected value in 
Equations (4.62) and (4.64) gives 


By = B{a{n?} win}? = 2b win, (4.65) 


n=0 n=0 


which indicates how the windowing process affects the signal energy. By Par- 
seval’s theorem, the K-point analysis DFT preserves this energy measure, as 
does the ERB energy estimation, by construction; the M-point IDFT likewise 
preserves the energy as long as the spectrum is constructed according to Equa- 
tion (4.42). Using a similar argument as for the analysis window, the effect of 
OLA with the length-M window v[n] can be shown to be 


_ 2Ew 


ry; S~ y[n] (v[n] + v1 [n] + ve[n]), (4.66) 


1=0 


Es 


where v, [n] and v2[n] are the second half of the window from the previous frame 
and the first half of the window from the subsequent frame, respectively: 


v|n+ > 0<n<e™ 


v(n] = 2 M 2 (4.67) 
0 > <n<M 
M 
0 O<ne< my 
vo[n] = (4.68) 
Vv |, _— > = <n< M. 


Note that a 50% overlap factor has been assumed in the derivation and that 
for a window v[n] that overlap-adds to one, the post-windowing and OLA do 
not affect the energy. In the IDFT/OLA synthesizer, the ERB spectrum is 
added to the partial spectrum before the IDFT, so the effective OLA window 
for the residual is a triangular window divided by the motif window. This 
hybrid window does not overlap-add to one, so the OLA scale factor must be 
taken into account. 

The energy E, given by Equation (4.66) is the discrete-time energy for a 
synthesis frame of length M. The energy of the continuous time output 4(t) is 


TO+Ts M-1 
E! = / (dt x So alm +nTPT, = EsTs, (4.69) 
TO n=0 


where 7’, is the synthesis sampling period and 7, is the duration of the M-sample 
output frame: 7, = MT,. However, since the input energy corresponds to an 
input segment of duration 7,, what is required is an equalization of the energy 
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for an output segment of that same duration t,. Letting O be the number of 
samples (at rate 1/T;) in such an output segment, the output energy is 


Onl O T, 
E" = sn’?T, = —E = -E'. 
s = Dn aie = Ee (4.70) 


n=0 


The entire transformation of the continuous time energies is then given by 
EY, = G,G,E}, (4.71) 


where the analysis and synthesis scaling factors are 


N-1 
G. = S> win]? (4.72) 
"Ma 
G, = We > v[m] (v[m] + vi[m] + ve[m]) . (4.73) 
m=0 


In the analysis, then, the signal should be multiplied by the scale factor 1/V/Ga 
before the ERB energies are calculated; at the synthesis stage, the output 
should be multiplied by 1/\/G, to equalize the energies. Listening tests have 
verified that the signal energy of Parseval’s theorem is an accurate measure 
of the loudness of broadband noise and that the outlined approach provides 
input-output equalization in the ERB analysis-synthesis. 


4.4 CONCLUSION 


In modeling complicated signals, it is often necessary to introduce a mixture 
of representations. This chapter described the specific framework of residual 
modeling, in which the signal is first reconstructed based on a primary model, 
and the difference between the original and the reconstruction is then mod- 
eled independently. For the multiresolution sinusoidal model, this residual is 
a colored noise process that can be parameterized in a perceptually accurate 
fashion in terms of the subband energies of the auditory filters; in audio ap- 
plications, this process includes features that are perceptually important for 
realism, e.g. breath noise in a flute. This chapter discussed basic filter bank 
models of the auditory system as well as a simple approach for designing cor- 
responding filter banks. Two implementations of the resulting residual model 
were developed and compared. It was shown that the parameterizations in 
the two cases are somewhat different; experimentally, however, the difference 
is imperceptible, which suggests that achieving transparent reconstruction of 
noiselike components requires only very crude heuristic models. It should be 
noted that the residual model discussed in this chapter is not signal-adaptive; 
rather, it is intended for use in conjunction with signal-adaptive models that 
extract coherent signal features. Of course, the residual model could be made 
signal-adaptive by using a filter bank with adaptive band allocation, for in- 
stance, but such adaptation has not proven necessary for modeling typical 
residuals. 


5 PITCH-SYNCHRONOUS MODELS 


...the crazy web of wavelets makes sense 
seen from high above... 


— Gary Snyder, “Bubbs Creek Haircut” 


lL the general sinusoidal model, the frequencies of the partials are estimated 
without regard for the possibility of harmonic structure; at least, it is not nec- 
essary to make any assumptions about the presence of such behavior. In cases 
where harmonic structure is prevalent, %.e. in periodic and pseudo-periodic sig- 
nals, this can be exploited to improve the signal model with respect to data 
reduction in that only the fundamental frequency need be recorded. In this 
chapter, a pitch-synchronous signal representation proposed in [54] is consid- 
ered; similar representations have been applied in prototype waveform speech 
coders [33, 116]. This pitch-dependent framework leads to simple sinusoidal 
models in which line tracking and peak detection are unnecessary because of 
the harmonic structure; furthermore, the representation leads to wavelet-based 
models that are more appropriate for pseudo-periodic signals than the lowpass- 
plus-details model of the standard discrete wavelet transform. By separately 
estimating the pitch or periodicity of a signal, improvements in both wavelet 
and sinusoidal models can be achieved. It should be noted that these approaches 
rely on robust pitch detection and thus apply only to signals whose periodic 
structure can be reliably estimated; in audio applications, then, appropriate 
signals consist of a single voice or a single instrument. 


5.1 PITCH ESTIMATION 


Pitch estimation or pitch detection refers to the problem of finding the basic 
repetitive time-domain structure within a signal. This issue has been explored 
most extensively in the speech and audio processing communities [104, 151, 190, 
219, 240]; the terminology is thus taken from these fields, but the methods apply 
to any pseudo-periodic signals. Pitch detection is reviewed in the section below; 
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the section thereafter proposes a simple algorithm for refining pitch estimates 
for the purpose of carrying out pitch-synchronous signal segmentation. 


5.1.1 Pitch Detection Algorithms 


Algorithms for pitch detection can be loosely grouped into time-domain and 
frequency-domain methods. In frequency-domain approaches, a short-time 
spectrum of the signal is analyzed for harmonic behavior, t.e. peaks in the spec- 
trum at frequencies with a common factor; this factor corresponds to the funda- 
mental frequency of the signal. In time-domain techniques, cross-correlations 
of nearby signal segments are computed at various lags; the lags that yield 
peaks in the cross-correlation correspond to the period of the signal. Both 
types of methods are fundamentally susceptible to errors: for instance, in the 
time domain, a two-period signal segment can be mistaken as a pitch period; 
in the frequency domain, dominance of either the odd or even harmonics, or a 
missing fundamental, can result in significant estimation errors. Various fixes 
have been proposed to account for these problems; for instance, based on the 
a priort knowledge that a typical musical signal does not have impulsive pitch 
discontinuities, a median filter can be applied to the pitch estimates to remove 
outliers and provide a more robust estimate [151, 190, 219]. 

For a more detailed discussion of pitch detection algorithms, the reader is 
referred to [104, 190, 219]. For the purposes of this chapter, it is assumed that a 
reliable pitch detection algorithm is available, and that the algorithm is capable 
of determining, perhaps according to some heuristic threshold, when no pitch 
can be reasonably assessed to the signal. Using this assessment, the algorithm 
can segment the signal into regions classified as pitched or unpitched. 


5.1.2 Phase-Locked Pitch Detection 


A standard pitch detector provides an estimate of the local pitch of a signal, 
which is essentially a simple parametric description of the local behavior. A 
rough description of the local behavior is not entirely adequate, however, for the 
applications to be discussed here; as will be seen, it is important that the pitch 
estimates correspond to precise structures in the signal. To achieve this corre- 
spondence, pitch estimates from a standard algorithm can be “phase-locked” 
to the signal as proposed below. First, it is assumed that a robust pitch de- 
tector such as the one described in [151] is used to generate a moving estimate 
of the pitch period; the output of the pitch detector is specifically assumed to 
consist of pitch periods and their corresponding time indices. This pitch period 
function will be denoted by P(t); since detectors generally estimate the pitch 
at some fixed interval T, the function P(t) can be equivalently represented as 
P(iT) = P(t)|tair. It is further assumed for the sake of notation that the pitch 
detection algorithm assigns a value of zero to P(t) when no reasonable pitch 
can be assessed to the signal. Note that the onset of a signal cannot typically be 
assigned a pitch, so P(t) = 0, or likewise P(tT’) = 0, will generally be the case 
in the onset regions; after the onset, if the signal becomes pseudo-periodic a 
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pitch can be estimated. A similar observation holds for transitions, for instance 
note-to-note changes in music; a pitch cannot be assigned to the interstitial re- 
gions. Given these assumptions and observations, the phase-locking algorithm 
is straightforward; it is explained here as well as in the flowchart of Figure 5.1: 


= For the first pitch detected after a region where P(iZ) = 0, find the cor- 
responding time point in the signal (f,) and search for the first subsequent 
positive-slope zero crossing in the signal. Denote this by fp. Since time is 
discretized and the zero crossing may not fall on a sample point, tp is chosen 
to correspond to the first positive signal value after the zero crossing. 


m= The time to lies between two times ¢, and #, for which pitches have been esti- 
mated by the initial pitch detection algorithm; thus, an appropriate estimate 
of the pitch period at time tp can be found by interpolating: 
P(ta){ty — to] + P(ty) [to — ta] 
ty — ta 
P(to) is the estimated length of the signal period starting at to. 


P(to) = (5.1) 


= Find the positive-slope zero crossing closest to (not necessarily after) the 
time tp + P(to). Denote this time by ¢;. Again, the time is rounded to 
correspond to the positive value after the zero crossing. 


= Interpolate to estimate P(t,), and then find t2, which is the time of the 
closest positive-slope zero crossing to t; + P(t;). 


=m Repeat the above step for t2, and so on, until a region where P(iT) = 0 is 
entered, at which point the algorithm should be restarted. 


= At stages in the interpolation when P(t,) 4 0 and P(t,) = 0, the interpo- 
lated pitch is assigned a zero value to prevent incongruous pitch estimates. 


= The time points {to,¢1,t2,...} indicate pitch period boundaries that can be 
used to construct a track of phase-locked period estimates P(t;) = tj41—t;. 
The starting times of the pitch periods follow positive-slope zero crossings 
by construction, so the first sample in any pitch period is positive and the 
last sample is negative. 


This phase-locking algorithm yields a set of refined pitch period estimates that 
correspond to pseudo-periodic structures that are synchronized to positive- 
slope zero crossings of the signal; as will be seen, synchronization at zero cross- 
ings, while seemingly arbitrary, is of importance for deriving a useful pitch- 
synchronous signal representation. Furthermore, it has also been reported that 
zero crossings are of physical significance in speech signals in that they are 
linked to instances when the glottis is closed [240]. 

Some wavelet-based algorithms for pitch estimation based on zero crossings 
have been discussed in the literature [140, 240]; the corrective phase-locking 
described above is adhered to in this treatment, however, since it is simple and 
allows for a quick synchronization of pitch period estimates to zero crossings in 
the signal. 
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save [to, P(to) = 0] 
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P(to) 


save [to, P(to) = t; — to] 


t; = PSZC 
closest to 
to + P(to) 


Figure 5.1. Flow chart for phase-locked pitch detection. The abbreviation PSZC refers to 
a positive-slope zero crossing in the signal. It is assumed that initial pitch period estimates, 
denoted by P(iT’), are derived by a standard pitch detection algorithm such as the one 
described in [151]. Additional details are given in the text. 


5.2 PITCH-SYNCHRONOUS SIGNAL REPRESENTATION 


Using the time points from the simple phase-locked pitch detector presented 
above, the signal can be divided into pseudo-periodic segments, #.e. pitch peri- 
ods that are synchronized to positive-slope zero crossings. This segmentation 
leads to a pitch-synchronous representation similar to the one proposed in [54]; 
this representation will prove useful for signal modeling. 


5.2.1 Segmentation 


In Section 4.1, mixed models of signals were discussed; this motivated consid- 
ering the sinusoidal model in terms of a deterministic-plus-stochastic decompo- 
sition where the stochastic component accounted for signal features not well- 
represented by the sinusoidal model. The overall model mixture then consisted 
of slowly-varying sinusoids and broadband noise. A representation similar to 
the deterministic-plus-stochastic decomposition has been widely applied in lin- 
ear predictive coding of speech, where the speech is coded using a time-varying 
source-filter model [136, 190]. The filter is adapted in time to match the speech 
spectrum, while the source is chosen based on a classification of the local speech 
signal as voiced or unvoiced. The characterization voiced refers to sounds that 
exhibit a strong periodicity, such as vowels; the corresponding source for the 
LPC model is a periodic impulse train. The alternative classification unvoiced 
designates sounds, for example sibilants and fricatives, which do not exhibit pe- 
riodic behavior and are heuristically more “noiselike”; the source for unvoiced 
sounds is typically white noise. Synthesis in the LPC framework is carried out 
by applying the appropriate source to the time-varying filter; when the input 
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is a periodic impulse train, the output has the pseudo-periodic structure char- 
acteristic of voiced sounds, whereas when the input is white noise, the output 
is simply colored noise and does not exhibit periodicities. 


In LPC, the voiced/unvoiced classification parameter indicates a segmenta- 
tion of the signal into regions where different models are appropriate. A simi- 
lar segmentation can be applied to arbitrary audio signals; because the terms 
“voiced” and “unvoiced” are inappropriate designations for musical signals, the 
terms “pitched” and “unpitched” will be used to classify the signal behavior. 
The phase-locked pitch detection algorithm described in the previous section 
is appropriate for deriving such a pitched/unpitched signal segmentation; re- 
gions where a pitch can be estimated are designated as pitched and regions 
where P(t) = 0 are classified as unpitched. This segmentation is markedly 
different from the deterministic-plus-stochastic decomposition described in the 
treatment of the sinusoidal model; as discussed in Section 4.1, in the sinusoidal 
model and in some LPC variations, the model mixtures are concurrent in time. 
For pitch-synchronous processing, however, it is basically necessary to neglect 
such concurrency and rigidly segment the signal into pitched and unpitched 
regions. As will be seen, this segmentation introduces some difficulties in the 
modeling of transient regions, but these difficulties are not insurmountable. 


In segmenting a dynamic signal such as a musical phrase, the transitions 
between regions of different pitch are classified as unpitched; pitch-synchronous 
processing algorithms are adjusted at these transitions to account for the pitch 
change. In addition to the variations across transitions, the pitch of a natural 
signal typically exhibits variations in each local pitch region. In signals with 
vibrato, these local pitch variations are clearly perceptible; such variations, 
however, are also generally present when a vibrato effect is not perceptible. 
Since the algorithms to be discussed require a uniform local pitch, it is necessary 
to remove these variations prior to processing. This is carried out by first 
segmenting the input signal into pitch periods within each local pitch region; 
these pitch period segments do not each have the same duration. Then, the 
pitch fluctuations are removed by adjusting the segments to have the same 
duration. As described in the next section, this adjustment can be performed by 
resampling. Note that the pitch variations can be reintroduced in the synthesis 
stage if necessary for realism. 


5.2.2 Resampling 


In general digital audio applications, it is often desirable to change the sam- 
pling rate; this can be done by converting the signal to continuous time and 
then sampling at the desired rate, but this approach is both inefficient and not 
robust to noise degradations. It is thus of interest to effect a change in sampling 
rate in the digital domain. This process is referred to as sample rate conversion 
or resampling. In the algorithms in this chapter, resampling will be used to re- 
move local pitch variations prior to carrying out pitch-synchronous processing. 
Removing the pitch fluctuations enables construction of the pitch-synchronous 
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signal representation discussed in the next section, which will prove useful for 
coding and modification. 

One method of resampling uses the familiar upsampling and downsampling 
operations. Changing the sampling rate of a sequence z[n] from f, to 5 fs 
is carried out by upsampling by P and then downsampling by Q, with some 
appropriate intermediate filtering to prevent aliasing [165]. The resulting se- 
quence is 5 times as long as 2[n]. A detailed consideration of this type of 
approach can be found in [218]. 

A primary difficulty with the filter-based approach to resampling is that 
it tends to introduce edge effects. This is problematic for the application of 
pitch period resampling since it leads to discontinuities at period boundaries in 
the pitch-synchronous signal models to be presented. An alternative method 
based on the discrete Fourier transform is more appropriate for this resampling 
application since it introduces fewer edge artifacts. 

Resampling using the DFT is carried out as follows [123]. Given a local pitch 
region consisting of [ pitch period segments, where the 7-th segment is denoted 
by 2z;[n] and its original period is denoted by Q;, the goal is to simply take 
these J segments and resample each one to have period P. For a pitch period 
z;(n] of length Q;, the first step is to compute a DFT of size Q;, unless of course 
Q; is equal to the target period P. The DFT spectrum is then truncated or 
extended to size P and an IDFT of size P scaled by on is used to generate the 
output sequence z;|[n] of length P; this resized spectrum is derived differently 
depending on the relative values of P and Q;: 


=» P=Q;. No resampling is necessary. Since this is computationally advanta- 
geous, the target period P for a local pitch region is chosen as the mode of 
the original periods {Q;,z € [1, J]} so that this case occurs frequently. 


=» P < Q;. The resampled output is to be shorter than the input, so the 
modified spectrum should have fewer bins than the original. This is carried 
out by discarding the P — Q; highest frequency bins, which is equivalent to 
eliminating the highest frequency harmonics from the signal. 


=» P > Q;. The resampled output is to be longer than the input, so the 
modified spectrum should have more bins than the original. This is done by 
introducing P — Q; high-frequency harmonics having either zero amplitude 
or nonzero amplitudes derived by extrapolating the original spectrum. 


Note that the Nyquist frequency bin, if present (when P or Q; is odd), is always 
zeroed out. Also note that since the sampling rate is necessarily large in high- 
quality audio applications, the periods P and Q; are both typically fairly large. 
Since local pitch variations are typically small with respect to the average local 
pitch, the spectral adjustments described above are relatively minor. The DFT 
computation, however, may be intensive, especially if P or Q; is prime. The cost 
is not prohibitive, however, since the algorithms to be discussed are intended 
primarily for off-line use. Further treatment of resampling is not merited here; 
from this point on, it is assumed that pitch variations can be reliably removed. 
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5.2.3 The Pitch-Synchronous Representation Matrix 


Once the pitch variations in the I pitch period segments have been removed 
via resampling, the signal can be reorganized into an I x P matrix 


xo|n] 
x; [n] 
xX =] lr] |, (5.2) 


o'_4(n] 


where z;[n] is a version of the pitch period z;[n] that has been resampled to 
length P. The matrix will be referred to as the pitch-synchronous representa- 
tion (PSR) of the signal. As described in the next section, this representation 
is useful for carrying out modifications; furthermore, structuring the signal 
in this fashion leads to the pitch-synchronous sinusoidal models and wavelet 
transforms discussed later. 

There are several noteworthy issues regarding the PSR. First, the matrix 
need not be constructed via resampling. Alternatively, the period lengths can 
be equalized by zero padding all of the period signals to the maximum period 
length [54] or by viewing each period as an impulse response and carrying out an 
extension procedure such as in some pitch-synchronous overlap-add (PSOLA) 
methods [158]. These approaches, however, do not yield the same smoothness as 
resampling; they do not necessarily preserve the zero-crossing synchronization 
and discontinuities may result in the reconstruction. 

A second issue regarding the PSR concerns the unpitched regions. Each 
pitched region in a signal has a preceding unpitched region; this structure al- 
lows the approach to be readily generalized from the single note scenario to 
the case of musical phrases. Given this argument, the considerations herein are 
primarily limited to signals consisting of a single note. In the single-note case, 
the preceding attack is then the unpitched region in question. To allow for uni- 
form processing of the signal, the attack is split into segments of length P and 
included in the PSR; the beginning of a signal is zero padded so that the length 
of the onset is a multiple of P. In later sections, perfect reconstruction of the 
attacks is considered in the frameworks of both pitch-synchronous Fourier and 
wavelet models. In either of the transforms, the signal is reconstructed after 
processing by concatenating the rows of the synthesis PSR, possibly resam- 
pled to the original pitch periods using pitch side information if perceptually 
necessary. 

An example of a PSR matrix is given in Figure 5.2 for a portion of a bassoon 
note. This bassoon signal and variations of a similar synthetic signal will be 
used throughout the chapter to illustrate the issues at hand. Note that the 
PSR is immediately meaningful for signals consisting either of single notes or 
several simultaneous notes that are harmonically related. For modeling musical 
phrases or voice, however, it is necessary to generate a different PSR for each 
pitch region in the signal; the various PSR matrices have different dimensions 
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Figure 5.2. A portion of a bassoon note and its pitch-synchronous representation. 


depending on the local pitch and duration of that pitch. This chapter focuses 
on the single-pitch case without loss of generality; extensions of the algorithms 
are straightforward. 


5.2.4 Granulation and Modification 


The pitch-synchronous representation is a granulation of the signal that can 
be readily used to facilitate several types of modification: time-scaling, pitch- 
shifting, and pitch-synchronous filtering. First, time-scaling can be carried 
out by deleting or repeating pitch period grains for time-scale compression 
or expansion, respectively; this can be done either in a structured fashion or 
pseudo-randomly. In speech processing and granular synthesis applications, 
similar techniques are referred to as deletion and repetition (10, 127]. Note that 
the time-scaling by deletion/repetition is accomplished without pitch-shifting, 
and that it is inherently made possible by the zero-crossing synchronization of 
the PSR; without this imposed smoothness of the model, discontinuities would 
result in the modified signal. 

Pitch-shifting based on the PSR is done simply by resampling the pitch 
periods; such pitch-shifting is not formant-corrected , however, but formant 
correction, which was discussed in Section 2.7.2, can be included by incorpo- 
rating a model of the spectral envelope in the DFT-based resampling scheme 
described earlier. Also, this pitch-shifting changes the duration of the signal, 
sO an accompanying deletion or repetition of the resampled pitch periods is 
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required to preserve the original time scale. Finally, given a pitch period seg- 
mentation of the signal, the signal can be viewed as the output of a time-varying 
source-filter model where the source is a pitch periodic impulse train and the 
time-varying filter determines the shape of the pitch period grains. In this light, 
a second time-varying pitch-synchronous filter can be applied to the signal by 
convolution with the individual pitch periods; the signal is then reconstructed 
by overlap-add of the new period segments. This notion leads to some time- 
varying modifications as well as pitch-based cross-synthesis of multiple signals. 
As described in Section 2.7, signals with pitched behavior are well-suited for 
modification. The ease of modification based on the pitch-synchronous repre- 
sentation is thus not particularly surprising. For signal coding, on the other 
hand, the PSR is not immediately useful; however, it does expose redundancies 
in the signal that can be exploited by further processing to achieve a compact 
representation. Two such techniques are described in the following sections. 


5.3. PITCH-SYNCHRONOUS SINUSOIDAL MODELS 


The peak picking, line tracking, and phase interpolation problems in sinusoidal 
modeling can be resolved by applying Fourier methods to a resampled pitch- 
synchronous signal representation. The pitch-synchronous representation is 
itself a signal-adaptive parametric model of the signal; by constructing the 
PSR, the signal is cast into a form which enables a Fourier expansion to be 
used in an effective manner. 

Of course, it is commonplace to model periodic signals using a Fourier se- 
ries expansion; it indeed provides a compact representation for purely periodic 
signals. Here, the Fourier series approach is applied to pseudo-periodic signals 
on a period-by-period basis. 


9.3.1 Fourier Series Representations 


A detailed review of Fourier series methods is given in Appendix B; various 
connections between the DFT and expansions in terms of real sines and cosines 
are indicated there. The result that is of primary interest here is that a real 
signal of length P can be expressed as 


ain] = x0 + = SIX [Alleos (wen + de), (5.3) 
k 


where w, = 27k/P, |X[k]| and ¢, are respectively the magnitude and phase 
of the k-th bin of a size-P DFT of z[n], and k ranges over the half spectrum 
(0, P/2]. Note that this magnitude-phase form resembles the sinusoidal model 
of Chapter 2. If the representation of Equation (5.3) is applied to the rows 
of a PSR, #.e. the pitch periods of a signal, the result is a pitch-synchronous 
sinusoidal model in which some of the difficulties of the general sinusoidal model 
are circumvented. The various simplifications arise because of the prior effort 
given to the process of pitch detection and signal segmentation. 
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5.3.2 Pitch-Synchronous Fourier Transforms 


Applying the Fourier series to the pitch-synchronous representation of a signal is 
equivalent to carrying out pitch-synchronous sinusoidal modeling. In this case, 
as explained below, the peak picking and line tracking problems are resolved 
by the pitch synchrony. 


Peak picking. The DFT of a pitch period samples the DIFT at the fre- 
quencies of the pitch harmonics, namely the frequencies w, = 27k/P for a 
pitch period of length P. These frequencies correspond to the relevant partials 
for the sinusoidal model. With regards to the discussion of peak picking and 
spectral resolution in Section 2.3.1, taking the DFT of a pitch period in the 
PSR is analogous to using a rectangular analysis window that spans exactly one 
pitch period, which provides exact resolution of the harmonic components with- 
out spectral oversampling. In short, spectral peaks do not need to be sought 
out as in the general sinusoidal model; here, each of the spectral samples in the 
DFT corresponds directly to a partial of the signal model. Partials with small 
amplitude can be neglected in order to reduce the complexity of the model and 
the computation required for synthesis, but this may lead to discontinuities as 
discussed later. 


Line tracking. In the pitch-synchronous sinusoidal model, the simplification 
of peak picking in the Fourier spectrum is accompanied by a simplification of 
the line tracking process. Given that the original signal is pseudo-periodic and 
that pitch variations are removed by resampling, the resulting representation 
has a well-behaved harmonic structure. Indeed, no line tracking is necessary 
at all since the frequencies of the partials, which correspond to the DFT bin 
frequencies, are the same in every period. Note that this insight applies to the 
case of a single note with an onset. To generate tracks that persist across mul- 
tiple notes, it is necessary to either impose births and deaths in the transition 
regions or to carry out line tracking of the harmonics across the transitions. 


5.3.3 Pitch-Synchronous Synthesis 


Since it is a basis expansion, the Fourier series representation can achieve per- 
fect reconstruction. Synthesis using basis vectors, however, is not particularly 
flexible. A generalized synthesis can be formalized by expressing a pitch period 
in the magnitude-phase form of Equation (5.3) and then phrasing the synthe- 
sis as a sum-of-partials model. This framework is considered in the following 
sections. 


Synthesis using a bank of oscillators. For a pitch period 2z;[n] of length 
P, the perfect reconstruction magnitude-phase expression is given by 


z[n) = = > |Xi[k]| cos (wan + dx,i) (5.4) 
k 
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for n € (0, P —1] and wu, = 27k/P, and where i € [0, J — 1] is a synthesis frame 
index that corresponds to the PSR row index. The signal can be constructed 
by concatenating the pitch period synthesis frames: 


2 
a(n] = 2 riln = 5 dX dX |X;[k]| cos (wen + o%.4)- (5.5) 
The pitch period 2;[n] is the i-th synthesis frame, and X;[k] is the DFT of 2;[n]; 
note that it has been assumed that X[0] = 0. The segment z;[n] is supported 
on the interval n € [tP,iP + P — 1]; X;[k] likewise corresponds to that time 
interval. More formally, this Fourier amplitude could be expressed as 


X;[k] (u[n — iP] — ul[n — (¢+1)P]), (5.6) 


where u/n] is the unit step function; the simpler but looser notation is adhered 
to in this treatment. 

While the same frequencies appear in each frame in the model of Equation 
(5.5), there are not necessarily actual partials that persist smoothly in time. 
Consider the contribution of the components at a single frequency w,: 


pein] = 5D [XilFl| cos (wan + On,)- (5.7) 


The phase terms are not necessarily the same in each frame, so for this single- 
frequency component the concatenation may have discontinuities at the frame 
boundaries. These discontinuities are eliminated in the full synthesis; their 
appearance in the constituent signals, however, indicates that if components 
are omitted to achieve compaction or if the phase is discarded, frame-rate dis- 
continuities will appear in the output. Because of these discontinuities, it is 
problematic to interpret the Fourier model in Equation (5.5) as a straightfor- 
ward sum of partials. 

The difficulty with phase discontinuities at the frame boundaries can be cir- 
cumvented by rephrasing the reconstruction as a sinusoidal synthesis using a 
bank of oscillators. Rather than relying on the standard Fourier basis func- 
tions, sinusoidal expansion functions that interpolate the amplitude and phase 
are generated such that the reconstruction indeed consists of evolving partials 
and not discrete Fourier atoms with boundary phase mismatches. This revision 
of the approach provides an example of the usefulness of a parametric model: 
in the approximate reconstructions of compact models, discontinuities occur at 
the frame boundaries if the Fourier basis is used for synthesis, but not if the 
synthesis is based on interpolation of the sinusoidal parameters; by construc- 
tion, the sinusoidal model is free from frame boundary discontinuities. Note 
however that this sinusoidal model, while it is perceptually accurate, does not 
carry out perfect reconstruction. 


Zero-phase sinusoidal modeling. In the standard sinusoidal model, the 
phase interpolation process in the synthesis stage is a high-complexity opera- 
tion; phase interpolation is one of the major obstacles in achieving real-time 
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synthesis [69]. This difficulty can avoided by taking advantage of the harmonic 
structure that the pitch-synchronous sinusoidal model exposes. 

In the pitch-synchronous sinusoidal synthesis discussed above, the phase of 
the harmonics is preserved. Phase interpolation from frame to frame is thus 
required, but this is problematic in two respects. First, it is computationally 
expensive. Second, the interpolation does not take into account a fundamental 
property of the representation, namely that the same frequencies are present 
in every frame; indeed, if the total phase of a partial is determined by fitting 
a cubic polynomial to the frequency and phase parameters in adjacent frames, 
the partial’s effective frequency will be time-varying, which is not desired in 
this pitch-synchronous algorithm. 

The time variation of the interpolated partial frequencies can be avoided 
in the following way. By construction, a Fourier sinusoid in a frame moves 
through an integral number of periods, meaning that its start and end phases 
are the same (one sample off, that is). Thus, for the corresponding sinusoid in 
the next frame to evolve continuously across the frame boundary, its starting 
phase should be one sample ahead of the end phase in the previous frame, or 
in other words it should simply be equal to the start phase from the previous 
frame. If this continuity is imposed, there is no phase interpolation required in 
the synthesis; a harmonic partial simply has the same phase in every frame. 

This method is referred to here as zero-phase sinusoidal modeling since the 
start phases in the first frame can all be set to zero; then, the start phase for 
every partial in every frame is zero. In some cases, though, it may be useful 
to preserve the phase in the first frame to ensure perfect reconstruction there; 
this technique can be used to reconstruct attacks without the delocalization 
incurred in the general sinusoidal model. This initial phase is then fixed as the 
start phase for all frames, so the signal reconstruction can be expressed as 


ain] = Sain s »» »» IX;[k]|cos(wan+¢n0) (5.8) 
= p Des (wan + bro) D IXi[k]l (6.9) 


= dX cos (wRn + ¢.,0) d A; i[n], (5.10) 


where the A; ;[n] are frame-wise constant amplitude parameters that corre- 
spond directly to the Fourier coefficients: 


Anil] = 51X14 (5.11) 


for n € [tP,iP+P-—1]. Amplitude interpolation can be included to smooth the 
stepwise amplitude envelopes of the partials in the concatenated reconstruction. 
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Then, the signal model is: 


z[n] = > > Ax,i[n] cos (wen + ox,0) (5.12) 
ik 
= > cos (wen + x,0) > A; [n], (5.13) 
k i 


which is simply a sum of partials with constant frequencies w;, each modulated 
by a linear amplitude envelope given by 


_ 2 |nX;[k] + (P- n)X;-1[k] 
Ax,[n] = Pp pa ees (5.14) 
In the first frame, where 1 = 0, perfect reconstruction can be carried out by 
defining the amplitude envelope to be a constant 


Ago[n] = = XolAl (5.15) 


for n € [0,P — 1]. More generally, this perfect reconstruction can be carried 
out over an arbitrary number of frames at the onset to represent the transient 
accurately. Since a prototypical signal consists of an unpitched region followed 
by a pitched region, the approach is to model the entire unpitched region per- 
fectly in the above fashion; once the pitched region is entered, the phase is fixed 
and the harmonic sum-of-partials model of Equation (5.12) is used. 

Many variations of pitch-synchronous Fourier modeling can be formulated. 
For instance, the amplitude interpolation can be carried out between the centers 
of adjacent pitch period frames rather than between the frame boundaries; this 
is similar to establishing the synthesis frames in the sinusoidal model according 
to the centers of the analysis frames. Such variations will not be considered 
here; some related efforts involving zero-phase modeling, or magnitude-only 
reconstruction, have been discussed in the literature [148, 186]. The intent 
here is primarily to motivate the usefulness of parametric analysis-synthesis 
and adaptivity for signal modeling; the key points to note are that the signal 
adaptivity achieved by estimating the pitch parameter simplifies the sinusoidal 
model significantly, and that the ability to incorporate perfect reconstruction 
allows for accurate representation of transients. Also note that in either zero- 
phase or fixed-phase modeling, the elimination of the phase information results 
in immediate data reduction, and that this compression is transparent since 
it relies on the well-known principle that the ear is insensitive to the relative 
phases of component signals. 


5.3.4 Coding and Modification 


There is a substantial amount of redundancy from one pitch period to the 
next; adjacent periods of a signal have a similar structure. This self-similarity 
is Clearly depicted in the pitch-synchronous representation shown in Figure 5.2 
and is of course the fundamental motivation for pitch-synchronous processing. 
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Since adjacent periods are similar, the expansion coefficients of adjacent peri- 
ods are also similar. Because of this frame-to-frame dependence, the expansion 
coefficients can be subsampled and/or coded differentially. Furthermore, mul- 
tiresolution modeling can be carried out by subsampling the tracks of the low 
frequency harmonics more than those of the high frequency ones; such subsam- 
pling reduces both the model data and the amount of computation required 
for synthesis. Indeed, the tracks can be approximated in a very rough sense; 
variations between pitch periods, which may be important for realism, can be 
reincorporated in the synthesis based on simple stochastic models. 

The signal modifications discussed in Section 2.7 can all be carried out in 
the pitch-synchronous sinusoidal model. It is interesting to note that some 
modifications such as time-scaling and pitch-shifting can either be implemented 
based on the sinusoidal parameterization or via the granular model of the PSR 
matrix. Note that modifications which involve resampling are accelerated in the 
pitch-synchronous sinusoidal model because the Fourier series representation 
can directly be used for resampling as described in Section 5.2.2. 


5.4 PITCH-SYNCHRONOUS WAVELET TRANSFORMS 


This section considers applying the wavelet transform in a pitch-synchronous 
fashion as originally proposed in [54, 55]. The pitch-synchronous wavelet trans- 
form (PSWT) is developed as an extension of the wavelet transform that is 
suitable for pseudo-periodic signals; the underlying signal models are discussed 
for both cases. After the algorithm is introduced, implementation frameworks 
and applications are considered. 


5.4.1 Spectral Interpretations 


The wavelet transform and the pitch-synchronous wavelet transform can be 
understood most simply in terms of their frequency-domain operation. The 
spectral decompositions of each transform are described below. 


The discrete wavelet transform. As discussed in Section 3.2.1, the signal 
model underlying the discrete wavelet transform can be interpreted in two 
complementary ways. At the atomic level, the signal is represented as a sum of 
atoms of various scales; the scale is long in time at low frequencies and short 
for high frequencies. Each of these atoms corresponds to a tile in the time- 
frequency tiling given in Figure 1.9(b) in Section 1.5.2. This atomic or tile- 
based perspective corresponds to interpreting the discrete wavelet transform as 
a basis expansion; each atom or tile is a basis function. 

As an alternative to the atomic interpretation, the wavelet transform can 
be thought of as an octave-band filter bank. As reviewed in Section 3.2.1, the 
discrete wavelet transform can be implemented with a critically sampled perfect 
reconstruction filter bank with a general octave-band structure; the coefficients 
of an atomic signal expansion in a wavelet basis can be computed with such 
a filter bank. This relationship between the atomic model and the filter bank 
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is evident in the tiling diagram of Figure 1.9(b); considered across frequency, 
the structure of the atomic tiles indicates an octave-band demarcation of the 
time-frequency plane. These bands in the tiling correspond to the subbands of 
the wavelet filter bank; in frequency, then, a wavelet filter bank splits a signal 
into octave bands, plus a final lowpass band. In a tree-structured iterated 
filter bank implementation, this final lowpass band corresponds to the lowpass 
branch of the final iterated stage; this branch is of particular interest for signal 
coding since it is highly downsampled. 

The filter bank interpretation indicates that the discrete wavelet transform 
provides a signal model in terms of octave bands plus a final lowpass band. 
The lowpass band is a coarse estimate of the signal. The octave bands provide 
details that can be added to successively refine the signal estimate; perfect 
reconstruction is achieved if all of the subbands are included. This lowpass- 
plus-details model is appropriate for signals which are primarily lowpass; since 
typical images are relatively lowpass signals, applications of the wavelet trans- 
form to image compression have been quite successful [205, 209]. However, 
for signals with wideband spectral content, such as high-quality audio, a low- 
pass estimate is a poor approximation. For any pseudo-periodic signals with 
high-frequency harmonic content, a lowpass estimate does not incorporate the 
high-frequency harmonics. Indeed, for general wavelet filter banks based on 
lowpass-highpass filtering at each iteration, representing a signal in terms of 
the final lowpass band simply amounts to lowpass filtering the signal and using 
a lower sampling rate, so it is not surprising that this compaction approach 
does not typically provide high-quality audio. Wavelet-based modeling of a 
bassoon signal is depicted in Figure 5.3 for the case of Daubechies wavelets 
of length eight; these wavelets will be used for all of the simulations in this 
chapter. Given the modeling inadequacy indicated in the figure, it is of interest 
to adjust the wavelet transform so that the signal estimate includes the higher 
harmonics. Upsampled wavelets provide such an adjustment. 


Upsampled wavelets. In models based on spectral decompositions, com- 
paction can be achieved by isolating regions of high energy. For modeling 
pseudo-periodic signals, then, it would be useful to modify the wavelet trans- 
form in such a way that the coarse estimate includes the signal harmonics. 
Conceptually, the first step in achieving this spectral revision is to consider the 
effect of upsampling the impulse responses of the iterated filters in a wavelet 
analysis-synthesis filter bank. The spectral motivation is described after the 
following mathematical treatment. 

As derived in Appendix A, a wavelet filter bank can be constructed by 
iterating critically sampled two-channel filter banks that satisfy the perfect 
reconstruction condition 


Go(z)Ho(z) + Gi(z)Hi(z) 
Go(z)Ho(—z) + Gi(z)Ai(—-z) 
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Figure 5.3. The discrete wavelet transform provides an octave-band decomposition of 
a signal. Compaction can be achieved by representing the signal in terms of the highly 
downsampled lowpass band; the signal estimate can be successively refined by incorporating 
the octave-band details. For a wideband pitched audio signal such as the bassoon note z[n], 
the higher harmonics extend throughout the wavelet subbands as indicated in the plot of the 
spectrum. As a result, the lowpass estimate Za, [n] does not capture the signal behavior 
accurately. The residual ray, [n] is the sum of the octave-band details. 


where the H;(z) are the analysis filters and the G;(z) are the synthesis filters. 
The perfect reconstruction condition still holds if the transformation z + 2” 
is carried out: 


Go(z™@)Ho(z”) + Gi(2e™) Ai (z™) 2 (5.18) 
Go(z™)Ho(—2”) + Gi(z’)Hi(-z”) = 0. (5.19) 


This transformed expression, however, is not the same perfect reconstruction 
condition that arises if the constituent filters are upsampled; a comparison 
of the two expressions will lead to a simple sufficient condition for perfect 
reconstruction in an upsampled wavelet filter bank. 

Given perfect reconstruction filters {Go(z),Gi(z), Ho(z), Hi(z)}, the ques- 
tion at hand is whether the upsampled filters 


Ag(z) = Go(z™) Bo(z) Ho(z™) 
A; (z) Gi(z™) Bi(z) = Hy(z™) 


also provide perfect reconstruction in a two-channel filter bank. The constraint 
on the new filters is then 


Ao(z).Bo(z) + A; (z)Bi(z) 
Ao(z)Bo(—z) + A; (z)Bi(—z) 


(5.20) 
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(a) One-scale Haar functions (b) Haar functions upsampled by two 


Figure 5.4. The one-scale Haar basis functions shown in (a) are upsampled by two to 
derive the functions shown in (b), which clearly do not span the signal space. 
which can be expressed in terms of the original filters as 
Go(z”@)Ho(z™) + Gi(z™)Hi(z™) 2 (5.23) 
Go(z™)Ho((-1)@z™) + Gi(z™)Mi((-1I)"%2"%) = 0. (5.24) 


Comparing this to the expressions in Equations (5.18) and (5.19) indicates 
immediately that perfect reconstruction holds when M is odd. An odd up- 
sampling factor is thus sufficient but not necessary for perfect reconstruction, 
meaning that for some filters, an even M will work, and for others not. The 
difficulty with an even M can be exemplified using the one-scale Haar basis; 
some of the one-scale Haar basis functions are depicted in Figure 5.4(a). Up- 
sampling the underlying Haar wavelet filters by a factor of two yields the new 
expansion functions shown in Figure 5.4(b), which do not span the signal space 
since every other time point in the function set is zero. The upsampled Haar 
functions are thus not a basis, so perfect reconstruction cannot be achieved 
with filters based on Haar wavelets upsampled by even factors. 

The formulation above showed that upsampling the impulse responses in a 
perfect reconstruction filter bank yields a new filter bank that in some cases also 
achieves perfect reconstruction. This is of interest since upsampling the filters 
adjusts the spectral decomposition derived by the filter bank. Specifically, the 
upsampling has the frequency-domain effect of compressing the spectrum by 
the upsampling factor, which admits spectral images into the range (0, 27]; 
then, the subband of a branch in the upsampled filter bank includes both the 
original band and these images. This spectral effect is illustrated in Figure 5.5 
for a depth-three wavelet filter bank with upsampling by factors of three and 
nine. Whereas in the original wavelet transform the signal estimate is a lowpass 
version, in the upsampled transform the estimate consists of disparate frequency 
bands as indicated by the shading. The insight here is that upsampling the 
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Figure 5.5. The spectral decompositions of a wavelet transform of depth three and the 
corresponding upsampled wavelet transform for an upsampling factor of three are given in the 
first two diagrams. The shaded regions correspond to the lowest branches of the transform 
filter bank trees, which provide the signal estimates. A higher degree of upsampling (9) 
yields the decomposition in the third plot. Such a decomposition is useful if the signal 
has harmonics that fall within the harmonically spaced bands indicated by the shading; this 
case arises when the upsampling factor corresponds to the pitch period. Note that only the 
positive-frequency half-band is shown in the plots. 


filters can be used to redistribute the subbands across the spectrum so as to 
change the frequency-domain regions that the filter bank focuses on. As will be 
seen, such redistribution can be particularly effective for spectra with strong 
harmonic behavior, t.e. pseudo-periodic signals. 

Several issues about the upsampled wavelet transform deserve mention. For 
one, a model in terms of the lowpass wavelet subband and a model in terms 
of the lowpass band with upsampled wavelets have the same amount of data. 
The upsampled case, however, differs from the standard case in that there is 
no meaningful tiling that can be associated with it because of the effect of the 
upsampling on the time-localization of the atoms in the decomposition. In a 
sense, the localizations in time and frequency are both banded, but this does 
not lend itself to a tile-based depiction. For this reason, the upsampled wavelet 
transform and the pitch-synchronous wavelet transform to be discussed cannot 
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be readily interpreted as atomic models. The granularity of the PSWT arises 
from the pitch period segmentation and not from the filtering process. 


Pitch-period upsampling. For signals with wideband harmonic structure, 
the lowpass estimate of the wavelet signal model does not accurately repre- 
sent the signal. In the previous section, it was shown that upsampling the 
wavelet filters alters the spectral decomposition derived by the filter bank. If 
the wavelets in the filter bank are upsampled by the pitch period, the result is 
that the lowpass band is reallocated in the spectrum to the regions around the 
harmonics. The upsampled lowpass filter has a passband around each harmonic 
frequency; the subband signal of this harmonic band corresponds to a pseudo- 
periodic estimate of the signal rather than a lowpass estimate. This leads to the 
periodic-plus-details signal model of the pitch-synchronous wavelet transform. 
The spectral decomposition of the PSWT is depicted in Figure 5.5; the har- 
monic band indicated by the shaded regions provides the pseudo-periodic signal 
estimate, and the inter-harmonic bands correspond to detail signals. The esti- 
mate is a version of the signal in which local period-to-period variations have 
been removed; these variations are captured by the detail signals, which can be 
incorporated in the synthesis if needed for perceptual realism. 

An example of the PSWT signal model is given in 5.6. It should be noted that 
the same amount of data is involved in the PSWT signal model of Figure 5.6 and 
the DWT signal model of Figure 5.3. The harmonic band of the PSWT simply 
captures the signal behavior more accurately than the lowpass band of the 
DWT. Implementation of the PSWT is discussed in the next section; because 
of the problem associated with upsampling by even factors, other methods of 
generating the harmonic spectral decomposition are considered. 


5.4.2 Implementation Frameworks 


The pitch-synchronous wavelet transform can be implemented in a number of 
ways. These are described below; the actual expansion functions in the various 
approaches are rigorously formalized in [54, 55). 


Comb wavelets. Based on the discussion on the spectral effect of upsampling 
a wavelet filter bank, a direct implementation of a pitch-synchronous wavelet 
transform simply involves upsampling by the pitch period P. The correspond- 
ing spectral decomposition has bands centered at the harmonic frequencies, and 
the signal is modeled in a periodic-plus-details fashion as desired. An impor- 
tant caveat, however, is that these comb wavelets, as derived in the treatment of 
upsampled wavelets and illustrated for the Haar case, do not guarantee perfect 
reconstruction if P is even. Because of this limitation, it is necessary to con- 
sider other structures that arrive at the same general spectral decomposition. 


The multiplexed wavelet transform. The problem with the spanning 
space in the case of comb wavelets can be overcome by using the multiplexed 
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Figure 5.6. The pitch-synchronous wavelet transform provides a decomposition that is 
localized around a signal’s harmonic frequencies. Compaction is achieved by representing 
the signal in terms of the narrow bands around the harmonics, which are coupled into one 
subband in the PSWT; the model can be refined by incorporating the inter-harmonic details. 
The inter-harmonic bands are not shown in the spectral plot for the sake of neatness. For a 
wideband pitched audio signal such as the bassoon note z[n], the harmonic-band estimate 
Eoew: (| Captures the signal behavior more accurately than the lowpass wavelet estimate 
Law:(N] plotted in Figure 5.3. The residual r,,,...[71] is the sum of the inter-harmonic details, 
and is clearly of lower energy than the wavelet residual ra..[7] in Figure 5.3. 


wavelet transform depicted in Figure 5.7. Here, the signal is demultiplexed 
into P subsignals, each of which is processed by a wavelet transform; these P 
subsignals correspond to the columns of the PSR matrix. The lowpass estimate 
in the wavelet transform of a subsignal is then simply a lowpass version of the 
corresponding PSR column. A pseudo-periodic signal estimate can be arrived 
at by reconstructing a PSR matrix using only the lowpass signals and then con- 
catenating the rows of the matrix. The net effect is that of pitch-synchronous 
filtering: period-to-period changes are filtered out. Perfect reconstruction can 
be achieved by incorporating all of the subbands of each wavelet transform. 


Interpretation as a polyphase structure. Polyphase methods have been 
of some interest in the literature, primarily as a tool for analyzing filter banks 
[238]. Here, it is noted that the multiplexed wavelet transform described 
above can be interpreted as a polyphase transform; a block diagram is given 
in Figure 5.8. The term polyphase simply means that a signal is treated in 
terms of progressively delayed and subsampled components, t.e. the phases of 
the signal. In the pitch-synchronous case, the signal is modeled as having P 
phases corresponding to the P pitch-synchronous subsignals. 
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Figure 5.7. Block diagram of the multiplexed wavelet transform. If the number of 
branches is equal to the number of samples in a pitch period, this structure implements 
a pitch-synchronous wavelet transform. 


Figure 5.8. Polyphase formulation of the pitch-synchronous wavelet transform. Perfect 
reconstruction holds for the entire system if the channel transforms provide perfect recon- 
struction. This structure is useful for approximating a signal with period P; the overall 
signal estimate is pseudo-periodic if the channel transforms provide lowpass estimates of the 
polyphase components. 


In Figure 5.8, the subsignals are processed with a general transform TJ’. For 
the PSWT, this is a wavelet transform. If only the lowpass bands of the wavelet 
transforms are retained, the signal reconstruction is a pseudo-periodic estimate 
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of the original signal. Indeed, such an estimate can be arrived at by applying 
any type of lowpass filters in the subbands; this structure is by no means 
restricted to wavelet transforms. For nonstationary or arbitrary signals it may 
even be of interest to consider more general transforms, and perhaps joint 
adaptive optimization of P and the channel transforms. 


Two-dimensional wavelet transforms. By carrying out a wavelet trans- 
form on the columns of the PSR matrix, the pitch-synchronous wavelet trans- 
form takes advantage of the similarity between adjacent pitch periods in a 
pseudo-periodic signal. Typical signals, however, also exhibit some redundancy 
from sample to sample; this redundancy is not exploited in the PSWT, but 
is central to the DWT. To account for both types of redundancy, the PSR 
can be processed by a two-dimensional wavelet transform; for separable two- 
dimensional wavelets, this amounts to coupling the PSWT and the DWT. A 
similar approach has been applied successfully to electrocardiogram data com- 
pression [128]. It is an open question, however, if this method can be used for 
high-quality compression of speech or audio. 


5.4.3 Coding and Modification 


In this section, applications of the pitch-synchronous wavelet transform for sig- 
nal coding and modification are considered. In the pitch-synchronous sinusoidal 
model, modifications are facilitated both by the granularity of the representa- 
tion and its parametric nature. The modifications based on granulation are 
still applicable here, but the pitch-synchronous wavelet transform does not 
readily support additional modifications; for instance, modification of the spec- 
tral components leads to discontinuities in the reconstruction as will be shown 
below. After a discussion of modifications, coding issues are explored. The 
model provides an accurate and compact signal estimate for pitched signals; 
furthermore, transients can also be accurately modeled since the transform is 
capable of perfect reconstruction. 


Spectral shaping. In audio processing and especially computer music, novel 
modifications are of great interest. An immediate modification suggested by 
the spectral decomposition of the pitch-synchronous wavelet transform is that 
of spectral shaping. If gains are applied to the subbands, the spectrum can 
seemingly be reshaped in various ways to achieve such modifications. How- 
ever, this approach has a subtle difficulty similar to the problem in the discrete 
wavelet transform wherein the reconstruction and aliasing cancellation con- 
straints are violated if the subbands are modified. The difficulty arises because 
of the period-to-period amplitude variations in the pseudo-periodic original sig- 
nal. The signal estimate is derived by averaging these varying pitch periods; 
by the nature of averaging, sometimes the estimate will be greater than the 
original and sometimes less. As a result, the residual, which is the sum of the 
detail signals, exhibits discontinuities at these transition points; an example 
of this is shown in Figure 5.9. These discontinuities cancel in a perfect sig- 
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Figure 5.9. The pitch-synchronous wavelet transform provides a signal decomposition 
in terms of a pseudo-periodic estimate and detail signals. The model shown here results 
from a three-stage transform; the residual is the sum of the details. The discontinuity 
occurs because the estimate is subsequently greater than and less than the original signal in 
adjacent periods. 


nal reconstruction; if the subbands are modified, however, the discontinuities 
may appear in the synthesis. This lack of robustness limits the usefulness of 
spectral manipulations in the PSWT signal model. Given the reconstruction 
difficulties, modifications are essentially restricted to those that are based on 
the granularity of the pitch-synchronous representation, which were discussed 
in Section 5.2.4. 


Signal estimation. In the discrete wavelet transform, the lowpass branch 
provides a coarse estimate of the signal; discarding the other subbands yields 
a compact model since the lowpass branch is highly downsampled. This type 
of modeling has proven quite useful for image coding [205, 209]. For audio, 
however, a lowpass estimate neglects high frequency content and thus tends to 
yield a low-quality reconstruction. Building from this observation, the pitch- 
synchronous wavelet transform estimates the signal in terms of its spectral 
content around its harmonic frequencies. For a pitched signal, these are the 
most significant regions in the spectrum, so the PSWT estimate captures more 
of a pitched signal’s behavior than the DWT. Figures 5.3 and 5.6 can be com- 
pared to indicate the relative performance of the DWT and the PSWT for 
modeling or estimation of pitched signals. 


Coding gain. A full treatment of the multiplexed wavelet transform for sig- 
nal coding is given in [54, 56]. The fundamental reason for the coding gain is 
that the periodic-plus-details signal model is more appropriate for signals with 
pseudo-periodic behavior than standard lowpass-plus-details models. For the 
same amount of model data, the PSWT model is more accurate than the DWT 
model. In rate-distortion terminology, the PSWT model has less distortion 
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than the DWT model at low rates; at high rates, where more subbands are 
included to refine the models, the DWT distortion performance is more com- 
petitive. One caveat in this comparison is that if the original signal varies in 
pitch in a perceptually meaningful way, it is necessary to store the pitch period 
values as side information so that the pitch periods can be resampled in the 
synthesis. Such coding issues are not considered in further depth. The next two 
sections deal with relevant modeling issues that arise in the pitch-synchronous 
wavelet transform; where appropriate, comments on signal coding are given. 


Stochastic models of the detail signals. Signal estimates based on the 
pitch-synchronous wavelet transform are shown in Figures 5.6 and 5.9. These 
estimates are smooth functions that capture the key musical features of the 
original signals such as the pitch and the harmonic structure. The PSWT esti- 
mate is thus analogous to the deterministic component of the sinusoidal model; 
in both cases, the decompositions directly involve the spectral peaks. Simi- 
larly, the detail signals have a correspondence to the stochastic component of 
the sinusoidal model. Given this observation, it is reasonable to consider mod- 
eling the detail signals as a noiselike residual. Such approaches are discussed 
in [54]. This analogy between the sinusoidal model and the PSWT, of course, 
is limited to pseudo-periodic signals; for signals consisting of evolving har- 
monics plus noise, the periodic-plus-details and deterministic-plus-stochastic 
(4.e. reconstruction-plus-residual) models are similar. 


Pre-echo in the reconstruction. Like the other signal models in this book, 
compact models based on the wavelet transform exhibit pre-echo distortion. Of 
course, this pre-echo is not introduced if perfect reconstruction is carried out; 
the problem arises when the signal is modeled in terms of the lowpass sub- 
band, which cannot accurately represent transient events. This distortion is 
considered here for the case of the discrete wavelet transform. Then, by con- 
sidering the pre-echo in the DWT models of the pitch-synchronous subsignals, 
it is shown that the pre-echo is more severe in the PSWT than in the DWT. 


As discussed in Section 3.2.1, the lowpass subband in the discrete wavelet 
transform is characterized by good frequency resolution and poor time reso- 
lution. The result is that transients are delocalized in signal estimates based 
on the lowpass subband. Consider the signal onset in Figure 5.10(a) and its 
lowpass wavelet model shown in Figure 5.10(b). Pre-echo is introduced in the 
lowpass model of the onset. Some of this pre-echo actually results from preci- 
sion effects in the wavelet filter specification, but the majority of it is caused 
by the poor time localization of the low-frequency subband. 


In the pitch-synchronous wavelet transform, each of the subsignals is mod- 
eled in terms of the lowpass band of a discrete wavelet transform, which means 
that each of these downsampled signals is susceptible to the pre-echo of the 
DWT model. The subsignal pre-echo occurs in the time domain at a subsam- 
pled rate, specifically a factor P less than the original signal. At the synthesis 
side, the subsignals are upsampled by the pitch period P and as a result the 
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Figure 5.10. Pre-echo is introduced in the discrete wavelet transform when a transient 
signal (a) is estimated in terms of the lowpass subband (b). The pre-echo is significantly 
increased in the pitch-synchronous wavelet transform model (c) since the discrete wavelet 
transform pre-echo occurs in each of the subsignals; noting the structure of Figure 5.8, the 
pre-echo in each subsignal is upsampled by the pitch period in the reconstruction, which 
accounts for the spreading. 


pre-echo is spread out by a factor of P. This drastic increase in the pre-echo is 
illustrated in Figure 5.10(c). Another example of PSWT signal estimation and 
pre-echo is given in Figure 5.11; here, the DWT clearly provides a poor model 
when compared to a PSWT involving the same amount of model data, i.e. the 
same depth of filtering. This example is discussed further in the next section. 


Perfect reconstruction of transients. In the previous section, it was shown 
that pre-echo occurs in compact PSWT models of transient signals. Indeed, 
pre-echo is a basic problem in compact models; here, the problem can be readily 
solved since the PSWT is capable of perfect reconstruction. The idea is to only 
use the compact model where appropriate and to carry out perfect reconstruc- 
tion where the compact model is inaccurate, namely near the transients. Near 
transients, there is significant energy in the detail signals of the PSWT; when 
this condition occurs, the subbands should be included in the model. In terms 
of the PSR matrix, this corresponds to representing the first few rows of the 
matrix exactly. Once the signal becomes pseudo-periodic, most of the energy 
falls in the harmonic bands and the inter-harmonic bands can be discarded 
without degrading the estimate. Thus, compaction is achieved in the pitched 
regions but not in the unpitched regions. An example of this signal-adaptive 
representation is given in Figure 5.11, which shows the pre-echo reduction that 
results from incorporating one detail signal into the reconstruction. With the 
exception of filter precision effects, perfect reconstruction is achieved if all of the 
detail signals are included; generally, all of the detail signals must be included 
to avoid introducing the aforementioned discontinuities into the reconstruction. 


164 ADAPTIVE SIGNAL MODELS 


1 
Signal ; @) 
onset 
-4 
° 50 100 150 200 250 300 
pwr RNR, 
0 
model 
+4 
e 50 100 150 200 250 300 
model 
“0 50 100 150 200 250 300 
Refined 
PSWT oo 
model 
~15 50 100 150 200 250 300 


Time (samples) 


Figure 5.11. Pre-echo in models of a synthetic signal (a) with higher harmonics. Though 
the two models involve the same amount of data, the lowpass DWT model in (b) is clearly 
a much less accurate signal estimate than the PSWT model in (c). In the PSWT, however, 
the pre-echo is spread out by a factor equal to the pitch period. By incorporating the 
detail signals in the onset region, the pre-echo can be reduced; perfect reconstruction of the 
transient is achieved by adding all of the details in the vicinity of the onset. 


In coding applications, the additional cost of allowing for perfect reconstruc- 
tion of transients is not significant; in a musical note, for instance, the attack 
is typically much shorter than the pseudo-periodic sustain region, so perfect 
reconstruction is required only over a small percentage of the signal. Further- 
more, since the attack region is perceptually important, perfect reconstruction 
of the attack is worthwhile from the standpoint of psychoacoustics; transparent 
modeling of attacks is necessary for high-quality audio synthesis. For musical 
phrases, then, perfect reconstruction is carried out in the unpitched regions 
while harmonic PSWT modeling is carried out for the pitched regions. This 
process preserves note transitions. In a sense, it also introduces a concurrency 
in the unpitched regions similar to that of the deterministic-plus-stochastic 
model. When the signal exhibits transient behavior, a full model with concur- 
rent harmonics and inter-harmonic details is used, whereas in stable pitched 
regions, the harmonic model alone is used. 

This approach of signal-adaptive modeling and reconstruction in the PSWT 
can be interpreted as a filter bank where only subbands with significant energy 
are included in the synthesis. Similar ideas have been employed in subband 
coding algorithms based on more standard filter bank structures such as the 
discrete wavelet transform and uniform filter banks [232, 238]. 
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5.5 APPLICATIONS 


Of course, pitch-synchronous methods such as the ones discussed in this chapter 
have immediate applications in audio processing. These have been considered 
throughout the chapter; a few further issues are treated in Section 5.5.1. Pitch- 
synchronous methods can also be applied to any signals with pseudo-periodic 
behavior, e.g. heartbeat signals. The advantages of such methods result from 
the effort applied to estimation of the pitch parameter and the accompanying 
ability to exploit redundancies in the signal. 


5.5.1 Audio Signals 


Application of pitch-synchronous Fourier and wavelet approaches to single-voice 
audio has been discussed throughout this chapter. These models provide com- 
pact representations that enable a wide range of modifications. In polyphonic 
audio, pitch-based methods are not as immediately applicable since a repetitive 
time-domain structure may not exist in the signal. In those cases it would be 
necessary to first carry out source separation to derive single-voice components 
with well-defined pitches; source separation is a difficult problem that has been 
addressed in the signal processing community and in the psychoacoustics litera- 
ture on auditory scene analysis (27, 239]. Given these difficulties, the PSWT is 
in essence primarily useful for the single voice case, which is relevant to speech 
coding and music synthesis; for instance, data compression can be achieved in 
samplers by using the signal estimate provided by the PSWT. 


5.5.2 Electrocardiogram Signals 


Electrocardiogram (ECG) signals, i.e. heartbeat signals, exhibit pseudo-periodic 
behavior. Nearby pulses are very similar in shape, but various changes in the 
behavior are medically significant. It is important, then, to monitor the heart- 
beat signal and record it for future analysis, especially in ambulatory scenarios 
where a diagnostic expert may not be present. For such applications, as in 
all data storage scenarios, it is both economically and pragmatically important 
to store the data in a compressed format while preserving its salient features. 
Various methods of ambulatory ECG signal compression have been presented 
in the literature; these rely on either the redundancy between neighboring sam- 
plings of the signal or the redundancy between adjacent periods [108, 134]. A 
method exploiting both forms of redundancy is proposed in [128]; here, the 
signal is segmented into pulses and arranged into a structure resembling a 
PSR matrix. Then, this structure is interpreted as an image and compressed 
using a two-dimensional discrete cosine transform (DCT); the compression is 
structured such that important features of the pulse shape are represented 
accurately. The pitch-synchronous approaches discussed in this chapter, espe- 
cially the extension to two-dimensional wavelets, provide a similar approach; 
important features such as transients can be preserved in the representation. 
Both this DCT-based ECG compression algorithm and the PSWT itself are 
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reminiscent of several other efforts involving multidimensional processing of 
one-dimensional signals, for instance image processing of audio [12, 214]. 


5.6 CONCLUSION 


For pseudo-periodic signals, the similarities between adjacent periods can be 
exploited to achieve compact signal models. This notion was the basic theme of 
this chapter, which opened with a discussion of estimation of signal periodicity 
and construction of a pitch-synchronous representation. This representation, 
which is itself useful for signal modification because of its granularity, served 
to establish a framework for pitch-synchronous processing. Specifically, it was 
shown that using a pitch-synchronous representation in conjunction with sinu- 
soidal modeling leads to a simpler analysis-synthesis and more compact models 
than in the general case. Furthermore, it was shown that the wavelet transform, 
which is intrinsically unsuitable for wideband harmonic signals, can be cast into 
a pitch-synchronous framework to yield effective models of pseudo-periodic sig- 
nals. In either case, the model improvement is a result of the adaptivity brought 
about by incorporating the pitch parameter in the signal model; furthermore, 
both approaches allow for perfect reconstruction of transients. 


6 MATCHING PURSUIT AND 
ATOMIC MODELS 


At worst, one is in motion; and at best, 
Reaching no absolute, in which to rest, 
One is always nearer by not keeping still. 


— Thom Gunn, “On the Move” 


L. atomic models, a signal is represented in terms of localized time-frequency 
components. Chapter 3 discussed an interpretation of the sinusoidal model as 
an atomic decomposition in which the atoms are derived based on parameters 
extracted from the signal; this perspective clarified the resolution tradeoffs in 
the sinusoidal model and motivated multiresolution extensions. In this chapter, 
signal-adaptive parametric models based on overcomplete dictionaries of time- 
frequency atoms are considered. Such overcomplete expansions can be derived 
using the matching pursuit algorithm [139]. The resulting representations are 
signal-adaptive in that the atoms for the model are chosen to match the signal 
behavior; furthermore, the models are parametric in that the atoms can be 
described in terms of simple parameters. The pursuit algorithm is reviewed 
and variations are described; primarily, the method is formalized for the case 
of dictionaries of damped sinusoids, for which the computation can be car- 
ried out with simple recursive filter banks. Atoms based on damped sinusoids 
are shown to be more effective than symmetric Gabor atoms for representing 
transient signal behavior such as attacks in music. 


6.1 ATOMIC DECOMPOSITIONS 


Time-frequency atomic signal representations have been of ongoing interest 
since their introduction by Gabor several decades ago [74, 75]. The fundamental 
notions of atomic modeling are that a signal can be decomposed into elementary 
functions that are localized in time-frequency and that such decompositions are 
useful for applications such as signal analysis and coding. This section provides 
an overview of the computation and properties of atomic models. The overview 
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is based on an interpretation of atomic modeling as a linear algebraic inverse 
problem, which is discussed below. 


6.1.1 Signal Modeling as an Inverse Problem 


As discussed in Chapter 1, a signal model of the form 


z[n] = >> amdm|n] (6.1) 


can be expressed in matrix notation as 
z= Da with D = [d, do... dy ... dy], (6.2) 


where the signal z is a column vector (N x 1), a is a column vector of expansion 
coefficients (M x1), and D isan NxM matrix whose columns are the expansion 
functions d,,[n]. In this framework, derivation of the model coefficients is an 
inverse problem. 

When the functions d,,[n] constitute a basis, such as in Fourier and wavelet 
decompositions, the matrix D in Equation (6.2) is square (NV = M) and invert- 
ible and the expansion coefficients a for a signal x are uniquely given by 


a = D"'z. (6.3) 
In the framework of biorthogonal bases, there is a dual basis D such that 
D- = D¥ and a = Dz, (6.4) 


where the superscript H denotes the conjugate (t.e. Hermitian) transpose; for 
the special case of an orthogonal basis, the dual basis is simply D = D. Consid- 
ering a single component in Equation (6.4), it can be seen that the coefficients 
in a basis expansion can each be derived independently using the formula 


Am = d4 x = (dm,z). (6.5) 


While this ease of computation is an attractive feature, basis expansions are 
not generally useful for signal modeling given the drawbacks demonstrated 
in Section 1.4.1; namely, basis expansions do not provide compact models of 
arbitrary signals. This shortcoming results from the attempt to model arbitrary 
signals in terms of a limited and fixed set of functions. 

To overcome the difficulties of basis expansions, signals can instead be mod- 
eled using overcomplete sets of atoms that exhibit a wide range of time-frequency 
behaviors [2, 31, 32, 139, 180]. Such overcomplete expansions allow for com- 
pact representation of arbitrary signals for the sake of compression or analysis 
[94, 139]. With respect to the interpretation of signal modeling as an inverse 
problem, when the functions d,,,[n] constitute an overcomplete or redundant set 
(M > N), the dictionary matrix D is of rank N and the linear system in Equa- 
tion (6.2) is underdetermined. The null space of D then has nonzero dimension 
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and there are an infinite number of expansions of the form of Equation (6.1). 
Various methods of deriving overcomplete expansions are discussed in the next 
section; specifically, it is established that sparse approximate solutions of an 
inverse problem correspond to compact signal models, and that computation 
of such sparse models calls for a nonlinear approach. 


6.1.2 Computation of Overcomplete Expansions 


As described in Section 1.4.2, there are a variety of frameworks for deriving 
overcomplete signal expansions; these differ in the structure of the dictionary 
and the manner in which dictionary atoms are selected for the expansion. Ex- 
amples include best basis methods and adaptive wavelet packets, where the 
overcomplete dictionary consists of a collection of bases; a basis for a signal 
expansion is chosen from the set of bases according to a metric such as entropy 
or rate-distortion [37, 102, 191]. In this chapter, signal decomposition using 
more general overcomplete sets is considered. Such approaches can be roughly 
grouped into two categories: parallel methods such as the method of frames 
(40, 41], basis pursuit [31, 32], and FOCUSS [2, 193], in which computation of 
the various expansion components is coupled; and, sequential methods such as 
matching pursuit and its variations [2, 28, 61, 84, 90, 107, 139, 168, 180, 181], in 
which models are computed one component at a time. All of these methods can 
be interpreted as approaches to solving inverse problems. For compact signal 
modeling, sparse approximate solutions are of interest; the matching pursuit 
algorithm of [139] is particularly useful since it is amenable to task of model- 
ing arbitrary signals using parameterized time-frequency atoms in a successive 
refinement framework. After a brief review of the singular value decomposi- 
tion and the pseudo-inverse, nonlinear approaches such as matching pursuit are 
motivated. 


The SVD and the pseudo-inverse. One solution to arbitrary inverse prob- 
lems can be arrived at using the singular value decomposition of the dictionary 
matrix, from which the pseudo-inverse D* can be derived. The coefficient vec- 
tor Dt z has the minimum two-norm of all solutions [220]. This minimization 
of the two-norm is inappropriate for deriving signal models, however, in that 
it tends to spread energy throughout all of the elements of &. Such spreading 
undermines the goal of compaction. 

An example of the dispersion of the SVD approach was given earlier in 
Figure 1.5. An alternative example is shown in Figure 6.1; here, the signal to 
be modeled is constructed as the sum of two functions from an overcomplete 
set, meaning that there is an expansion in that overcomplete set with only two 
nonzero coefficients. This exact sparse expansion is shown in the plot by the 
asterisks; the dispersed expansion computed using the SVD pseudo-inverse is 
indicated by the circles. The representations can be immediately compared with 
respect to two applications: first, the sparse model is clearly more appropriate 
for compression; second, it provides a more useful analysis of the signal in 
that it identifies fundamental signal features. This simulation thus provides 
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Figure 6.1. Overcomplete expansions and compaction. An exact sparse expansion of a 
signal in an overcomplete set (*) and the dispersed expansion given by the SVD pseudo- 
inverse (0). 


an example of an issue discussed in Section 1.2, namely that compression and 
analysis are linked. 


Sparse approximate solutions and compact models. Given the desire 
to derive compact representations for signal analysis, coding, denoising, and 
modeling in general, the SVD is not a particularly useful tool. An SVD-based 
expansion is by nature not sparse, and thresholding small expansion coefficients 
to improve the sparsity is not a useful approach [84, 160]. A more appropri- 
ate paradigm for deriving an overcomplete expansion is to apply an algorithm 
specifically designed to arrive at sparse solutions [194]. Because of the com- 
plexity of the search, however, it is not computationally feasible to derive an 
optimal sparse expansion that perfectly models a signal. It is likewise not fea- 
sible to compute approximate sparse expansions that minimize the error for a 
given sparsity; this is an NP-hard problem [42]. For this reason, it is necessary 
to narrow the considerations to methods that either derive sparse approximate 
solutions according to suboptimal criteria or derive exact solutions that are not 
optimally sparse. The matching pursuit algorithm introduced in [139] is an ex- 
ample of the former category; it is the method of choice here since it provides a 
framework for deriving sparse approximate models with successive refinements 
and since it can be carried out with low cost as will be seen. Methods of the 
latter type tend to be computationally costly and to lack an effective successive 
refinement framework (32, 193]. 


6.1.3 Signal-Adaptive Parametric Models 


The set of expansion coefficients and functions in Equation (6.1) provides a 
model of the signal. If the model is compact or sparse, the decomposition 
indicates fundamental signal features and is useful for analysis and coding. 
Such compact models necessarily involve expansion functions that are highly 
correlated with the signal; this property is an indication of signal adaptivity. 
As discussed throughout this book, effective signal models can be achieved 
by using signal-adaptive expansion functions, e.g. the multiresolution sinusoidal 
partials of Chapter 3 or the pitch-synchronous grains of Chapter 5. In those 
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approaches, model parameters are extracted from the signal by the analysis 
process and the synthesis expansion functions are constructed using these pa- 
rameters; in such methods, the parameter extraction leads to signal adaptiv- 
ity. In atomic models based on matching pursuit, signal adaptivity is instead 
achieved by choosing expansion functions from the dictionary that match the 
time-frequency behavior of the signal. When the dictionary has a parametric 
structure, z.e. when the atoms in the dictionary can be indexed by meaningful 
parameters, the resultant model is both signal-adaptive and parametric. While 
this framework is fundamentally different from that of traditional parametric 
models, the signal models in the two cases have similar properties. 


6.2 MATCHING PURSUIT 


Matching pursuit is a greedy iterative algorithm for deriving signal decomposi- 
tions in terms of expansion functions chosen from a dictionary [139]. To achieve 
compact representation of arbitrary signals, it is necessary that the dictionary 
elements or atoms exhibit a wide range of time-frequency behaviors and that 
the appropriate atoms from the dictionary be chosen to decompose a particu- 
lar signal. When a well-designed overcomplete dictionary is used in matching 
pursuit, the nonlinear nature of the algorithm leads to compact signal-adaptive 
models [94, 139, 180] 

A dictionary can be likened to the matrix D in Equation (6.2) by considering 
the atoms to be the matrix columns; then, matching pursuit can be interpreted 
as an approach for computing sparse approximate solutions to inverse problems 
[84, 160]. Related approximation methods have indeed been used in linear alge- 
bra for some time [81, 160]. Furthermore, matching pursuit is similar to some 
forms of vector quantization and is related to the projection pursuit method 
investigated in the field of statistics for the task of finding compact models of 
data sets [73, 80, 105]. 


6.2.1 One-Dimensional Pursuit 


The greedy iteration in the matching pursuit algorithm is carried out as follows. 
First, the atom that best approximates the signal is chosen, where the two-norm 
is used as the approximation metric because of its mathematical convenience. 
The contribution of this atom is then subtracted from the signal and the process 
is iterated on the residual. Denoting the dictionary by D, the task at the i-th 
stage of the algorithm is to find the atom d,,;)[n] € D that minimizes the 
two-norm of the residual 


riti[n] = ri[n] — aid p(n], (6.6) 


where a; is a weight that describes the contribution of the atom to the signal, 
i.e. the expansion coefficient, and m(i) is the dictionary index of the atom 
chosen at the i-th stage; the iteration begins with r,{n] = z[n]. To simplify 
the notation, the atom chosen at the i-th stage is hereafter referred to as g;(n], 
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where 


gi[n] = daciy[n] (6.7) 


from Equation (6.6). The subscript 7 refers to the iteration when g;[n] was 
chosen, while m(i) is the actual dictionary index of g;/[n]. 

Treating the signals as column vectors, the optimal atom to choose at the 
i-th stage can be expressed as 


gi = arg min | risill? = arg min |[ri — aigill’. (6.8) 

The orthogonality principle gives the value of a;: 
(riti,9i) = (ri — 049i, 91) = (ri —049:)" 9: = 0 (6.9) 
(9i, Ti) _ (9i, Ti) 
(9i, 91) Al 


where the last step follows from restricting the atoms to be unit-norm. The 
norm of rj,,[n] can then be expressed as 


= (9i,Ti), (6.10) 


(9, ri) |? 
lIrezall? = Ilrill? - eer = |Ir;|? — laxl?, (6.11) 
a 


which is minimized by maximizing 
? = |(gisrs)/?. (6.12) 


This is simply equivalent to choosing the atom whose inner product with the 
signal has the largest magnitude; thus, Equation (6.8) can be rewritten as 


la; 


gi = arg max (9: 7:)|- (6.13) 


An example of this optimization is illustrated geometrically in Figure 6.2. Note 
that Equation (6.11) shows that the norm of the residual decreases as the 
algorithm progresses provided that an exact model has not been reached and 
that the dictionary is complete; for an undercomplete dictionary, the residual 
may belong to a subspace that is orthogonal to all of the dictionary vectors, in 
which case the model cannot be further improved by pursuit. 

In deriving a signal decomposition, the matching pursuit is iterated until the 
residual energy is below some threshold or until some other halting criterion is 
met. After I iterations, the pursuit provides the sparse approximate model 


I I 
ain] ~ > cain] = S > aida [n). (6.14) 


i=l 


According to Equation (6.11), the mean-square error of this model decreases 
as the number of iterations increases [139]. This convergence implies that I 
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Figure 6.2. Matching pursuit and the orthogonality principle. The two-norm or Euclidean 
length of rj41 is minimized by choosing g; to maximize the metric |(g;,7;)| and a; such 
that (ri41,9:i) = 0. 


iterations will yield a reasonable J-term model; this model, however, is in 
general not optimal in the mean-square sense because of the term-by-term 
greediness of the algorithm. Computing the optimal J-term estimate using an 
overcomplete dictionary requires finding the minimum projection error over all 
I-dimensional dictionary subspaces, which is an NP-hard problem as mentioned 
earlier; this complexity result is established in [42] by relating the optimal ap- 
proximation problem to the exact cover by 3-sets problem, which is known to 
be NP-complete. 

To enable representation of a wide range of signal features, a large dictio- 
nary of time-frequency atoms is used in the matching pursuit algorithm. The 
computation of the correlations (g,r;) for all g € D is thus costly. As derived in 
[139], this computation can be substantially reduced using an update formula 
based on Equation (6.6); the correlations at stage i+ 1 are given by 


(9, Tit1) = (9, Ti) ~ ai(9, 9:), (6.15) 


where the only new computation required for the correlation update is the 
dictionary cross-correlation term (g, gi), which can be precomputed and stored 
if enough memory is available. This is discussed further in Section 6.4.3. 


6.2.2 Subspace Pursuit 


Though searching for the optimal high-dimensional subspace is not computa- 
tionally reasonable, it is worthwhile to consider the related problem of find- 
ing an optimal low-dimensional subspace at each pursuit iteration if the sub- 
spaces under consideration exhibit a simplifying structure. In subspace pur- 
suit, the i-th iteration consists of searching for an N x R matrix G, whose 
R columns are dictionary atoms, that minimizes the two-norm of the residual 
Ti41 = 14 — Ga, where a is now an R x 1 vector of weights. This R-dimensional 
formulation is similar to the one-dimensional case; the orthogonality constraint 
(r; -Ga,G) = 0 leads to a solution for the weights: 


a = (G4G) G¥r,. (6.16) 
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The energy of the residual is then given by 
(Tigi, Titi) = (TET) - ri'G (Gia) G" r;, (6.17) 


which is minimized by choosing G so as to maximize the second term. This 
approach is expensive unless G consists of orthogonal vectors or has some other 
special structure. 


6.2.3 Conjugate Subspaces 


One useful subspace to consider in subspace pursuit is the two-dimensional 
subspace spanned by an atom and its complex conjugate. Here, the two columns 
of G are simply an atom g and its conjugate g*. If the signal r; is real and if g 
has nonzero real and imaginary parts so that G has full column rank and G?G 
is invertible, the results given above can be simplified. The metric to maximize 
through the choice of g, namely the second term in Equation (6.17), can be 
rewritten as 


EOP aE (21(9, ri) |? — (9,.9*)((9, 7:)*)? — (9, 9")* (9, 74)?) (6.18) 
and the optimal weights are 
_ | a1) } _ 1 (9,7i) — (9,.9*)(9, 7%)" 
7 | a(2) | ~~ 1=|(g,9*)|? | (g.7ad* — (9, 9°)" (9,78) | (6.19) 


Note that the above metric can also be written as 


(g,ri)"a(1) + (g,ra)a(1)" = 2Re {(g,ri)"a(1)} (6.20) 
and that a(1) = a(2)*, meaning that the algorithm simply searches for the 
atom g; that minimizes the two-norm of the residual 

rizi(n] = riln] — ai(1)gi[n] — a4(1)*97 [7] (6.21) 
ri[n] — 2Re{a;(1)g;|[n]}, (6.22) 


which is real-valued; the orthogonal projection of a real signal onto the subspace 
spanned by a conjugate pair is again real. Using such conjugate subspaces yields 
decompositions of the form 


I 
x © 2) Re{a;(1)g;[n]}. (6.23) 


t=1 


This approach thus provides real decompositions of real signals using an under- 
lying complex dictionary. A similar notion based on a different computational 
framework is discussed in [139]. 

For dictionaries consisting of both complex and purely real (or imaginary) 
atoms, the real atoms must be considered independently of the various con- 
jugate subspaces since the above formulation breaks down when g and g* are 
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linearly dependent; in that case, |(g,g*)| = 1 and the matrix G is singular. 
It is thus necessary to compare metrics of the form given in Equations (6.18) 
and (6.20) for conjugate subspaces with metrics of the form |(g,7r;)|? for real 
atoms. These metrics quantify the amount of energy removed from the residual 
in either case, and thus provide for a fair choice between conjugate subspaces 
and real atoms in the pursuit decomposition. 


6.2.4 Orthogonal Matching Pursuit 


As depicted in Figure 6.2, the matching pursuit algorithm relies on the or- 
thogonality principle. At stage 7, the residual r; is projected onto the atom 
gi such that the new residual r;,; is orthogonal to g;. If the dictionary is 
highly overcomplete and its elements populate the signal space densely, the 
first few atoms chosen for a decomposition tend to be orthogonal to each other, 
meaning that successive projection operations extract independent signal com- 
ponents. Later iterations, however, do not exhibit this tendency; the selected 
atoms are no longer orthogonal to previously chosen atoms and the projection 
actually reintroduces components extracted by the early atoms. This problem 
of readmission is addressed in orthogonal matching pursuit and its variations; 
the fundamental idea is to explicitly orthogonalize the functions chosen for the 
expansion. 


Backward orthogonal matching pursuit. Orthogonal matching pursuit 
is a basic variation of the matching pursuit algorithm [168]. In this method, the 
i-th stage is initiated by selecting an atom g; according to the correlation metric 
as in the standard pursuit; then, rather than orthogonalizing the residual r;4, 
with respect to the single atom g;, the residual is orthogonalized with respect to 
the subspace spanned by the atoms chosen for the expansion up to and including 
the i-th stage, i.e. the atoms {91, 92,... ,9;}. To achieve this orthogonalization, 
however, it is necessary to modify all of the expansion coefficients at each stage. 
This issue is clarified by interpreting orthogonal matching pursuit as a subspace 
pursuit in which the space is iteratively grown. In terms of the discussion of 
subspace pursuit in Section 6.2.2, the subspace matrix is 


Gi = [91 g2 -:: gil (6.24) 
and the orthogonalization criterion is 
(rig41,G;) = (2 —Gja;,G;) = 0. (6.25) 


This constraint can be used to derive the appropriate vector of coefficients a; 
for the subspace projection, namely 


a; = (GIG,) Ga, (6.26) 


which differs from Equation (6.16) in that the coefficients are derived as a 
function of the original signal x and not as a function of the residual r;. The 
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correlation metric for atom selection, however, is based on the residual signal as 
in one-dimensional pursuit. Note that the inverse (GH Gi) ~* can be computed 


recursively using the matrix inversion lemma and the inverse (GE Gin)” 
computed at the previous stage [168]. 

At any given stage of an orthogonal pursuit, derivation of the new set of 
expansion coefficients can be interpreted as a Gram-Schmidt orthogonalization 
carried out on the new atom chosen for the expansion. This interpretation 
can be established in an inductive manner by first assuming that the atoms 
at stage 1 — 1 have been orthogonalized by a Gram-Schmidt process; in other 
words, assume that the matrix G;_; has been converted into a matrix G;_1 
with the same column space but with orthogonal columns. In this framework, 
the signal approximation at stage 1 — 1 can be expressed as 


wR Gj_104-1, (6.27) 
where the expansion coefficients are given by 
aj. = Gis (6.28) 


since the columns of G;_; are orthogonal; note that G;_1 is an N xi—1 matrix. 
At stage 1, a new unit-norm atom g; is chosen for the expansion according to 
the magnitude correlation metric of Equation (6.12); then, the approximation 
error is minimized by projecting the signal onto the subspace spanned by the 
columns of the new matrix 


G; = [Gi-1 CAB (6.29) 
Using the solution for the coefficients given in Equation (6.26), the i-term 
signal decomposition can be written as follows, where c = G?!,g; and I;_; is 
an identity matrix of sizei—1xi-1: 


Gia; = [Gi-1 9: a = [Gi-1 9 (Gia, Gir (6.30) 
= —1 
a Li-1 G19: Qi-1 
= 1 Gi = 6.31 
Gir ol| ong, 1 | or | (632) 


= I;- 4+ oer —F- Qj 
[Gi-1 gil | r "Ine I-eMe | | 7 | (6.32) 


The decomposition can then be simplified to 


(G;_,Gi, — In) 9,97 (G;1G#, ~In)x 


Ji \® (6.33) 
1 — 9/'G;_, GP 49; 


Gio; = Gi-10j-1 + 
= Gy10j-1 + gigi? 2 (6.34) 


(93,2) 93, (6.35) 
j=l 
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where the i-th orthogonal expansion vector is given by 
G,_,G#, — In)g; 
Gi —_ ( t—1 — xi (6.36) 
1 — 9) Gj_1 Gi 9 


The vector g; has unit norm and is orthogonal to the columns of the ma- 
trix G;-1. This orthogonality, combined with the initial orthogonality of the 
columns of Gj_1, indicates that the final expression in Equation (6.34) is a 
basis expansion in an i-dimensional subspace of the signal space. It is not a 
basis expansion of the original signal; it is an approximation of the signal in the 
subspace spanned by {91,92,.-. ,gi}, for which an orthogonal basis has been 
derived by the Gram-Schmidt method. For 1 = N, this subspace is equivalent 
to the signal space and a perfect representation is achieved. 

The Gram-Schmidt orthogonalization discussed above need not be explicit 
in an implementation of orthogonal matching pursuit. As mentioned earlier, 
the pursuit can be carried out with reference to the original dictionary atoms 
by updating the inverse (GP Gi)~ using the matrix inversion lemma; this 
approach preserves the parametric nature of the expansion, which would be 
compromised if the atoms were explicitly modified via the Gram-Schmidt pro- 
cess. Also, note that the algorithm corrects for readmitted components in the 
orthogonalization step. Since this corrective orthogonalization is carried out af- 
ter the atom selection, the algorithm can be referred to as a backward method; 
this designation serves to differentiate it from the forward approach discussed 
in the next section. 

A number of variations of orthogonal matching pursuit can be envisioned. 
For instance, the orthogonalization need not be carried out every iteration. In 
the limit, an expansion given by a one-dimensional pursuit can be orthogonal- 
ized after its last iteration by projecting the original signal onto the subspace 
spanned by the iteratively chosen expansion functions. This projection oper- 
ation minimizes the error of the residual for approximating the signal using 
those particular expansion functions, but the approximation is of course not 
necessarily optimal. 

In the literature, speech coding using orthogonal matching pursuit has been 
discussed [196]. Furthermore, a number of refinements of the algorithm have 
been proposed [2, 61]. Such refinements basically involve different ways in which 
orthogonality is imposed or exploited; for instance, orthogonal components can 
be evaluated simultaneously as in basis expansions [61]. The following section 
discusses a method which employs the Gram-Schmidt procedure in a different 
way than the backward pursuit described above. 


Forward orthogonal matching pursuit. In orthogonal matching pursuit 
as proposed in [168], which corresponds to the backward pursuit described 
above, the atom g; is chosen irrespective of the subspace spanned by the first 
1 — 1 atoms, t.e. the column space of G;_, and then orthogonalization is car- 
ried out. Given a decomposition with i — 1 atoms, however, the approximation 
error of the succeeding i-term model can be decreased if the choice of atom 
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is conditioned on G;_;. As will be seen, this conditioning leads to a forward 
orthogonalization of the dictionary; in other words, the dictionary is orthogo- 
nalized prior to atom selection. 

Using a similar induction framework as above, where G;_1 is assumed to 
have orthogonal columns, the i-term expansion can be expressed as in Equation 
(6.33); the difference in the forward algorithm is that the atom g; has not yet 
been selected. Rather than choosing the atom to maximize the magnitude of 
the correlation (g,7;) as above, in this approach the atom is chosen to maximize 
the metric 


2"G,(GUG,;) Gis, (6.37) 


which corresponds to the second term in Equation (6.17) from the general 

development of subspace pursuit. For this specific case, where the subspace is 

again iteratively grown, the metric can be expressed as 

(GiO8, — Iw) 99! G,aGE -Iv)] , 
1 — gi G;_, Gi 19; 


= 2% [G;1Gey + 5,93 | z, 


alt e..Gh, + (6.38) 


where 9; is derived from g; € D according to Equation (6.36). In the earlier 
formulation, g; was chosen and 9; was derived from that choice so as to be 
orthogonal to the columns of G;_1. In this approach, on the other hand, all 
possible 9; are considered for the expansion, and the one which maximizes the 
metric is chosen. Given some Gj_1, the i-term approximation error resulting 
from this choice of g; will always be less than or equal to the error of the i-term 
approximation arrived at in the backward orthogonal matching pursuit. 

Note that all of the atoms 9g; are orthogonal to G;_1 by construction. This 
observation suggests an interpretation of this variation of orthogonal matching 
pursuit. Namely, the forward approach is equivalent to carrying out a Gram- 
Schmidt orthogonalization on the dictionary at each stage. Once an atom is 
chosen from the dictionary for the expansion, the dictionary is orthogonalized 
with respect to that atom; in the next stage, correlations with the orthogonal- 
ized dictionary, namely (g, x), are computed to find the atom that maximizes 
the metric. This orthogonalization process completely prevents readmission, 
but at the cost of added computation to maintain the changing dictionary. 


Greedy algorithms and computation-rate-distortion. In matching pur- 
suit and its orthogonal variations, each iteration attempts to improve the signal 
approximation as much as possible by minimizing some error metric. In orthog- 
onal pursuits, the metric depends on previous iterations; in any case, however, 
the approximation is made without regard to future iterations. Matching pur- 
suit is thus categorized as a greedy algorithm. It is well known that such greedy 
algorithms, when applied to overcomplete dictionaries, do not lead to optimal 
approximations, #.e. optimal compact models; however, greedy approaches are 
justified given the complexity of optimal approximation [42, 160]. Furthermore, 
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it should be noted as in Section 6.1.2 that the use of a greedy algorithm in- 
herently leads to successive refinement, which is a desirable property in signal 
models. 


For the application of compact signal modeling, it is of interest to compare 
the approximation errors of matching pursuit and the backward and forward 
orthogonal pursuits. This comparison, however, can only be made in a defini- 
tive sense for the case where each algorithm is initiated at stage 7 with the same 
first 1 — 1 atoms; then, the energy removed from. the residual in the forward 
case is always greater than or equal to that in the backward approach, which is 
in turn greater than or equal to that in the standard pursuit. Conditioned on 
the first 1 — 1 terms, the forward approach provides the optimal i-term approx- 
imation. For the case of arbitrary i-term decompositions, however, no absolute 
comparison can be made between the algorithms. While error bounds can be 
established for the various greedy approximations, the relative performance for 
a given signal cannot be guaranteed a priori since the algorithms use different 
strategies for selecting atoms [44, 45, 121]. However, useful predictive com- 
parisons of the algorithms can be carried out using ensemble results based on 
random dictionaries [2]. 


In the preceding paragraph, as in most discussions of signal modeling, com- 
parisons between the various models are phrased in terms of the amount of 
information required to describe a certain model, i.e. the compaction, and the 
approximation error of the model; this rate-distortion tradeoff is the typical 
metric by which models are compared. In implementations, however, it is also 
important to account for the resources required for model computation. In 
general, a model can achieve a better rate-distortion characteristic through 
increased computation. For example, recall from the earlier discussion that 
an approximation provided by a standard pursuit can be improved after the 
last stage by backward orthogonalization of the full expansion; this process 
results in a lower distortion at a fixed rate at the expense of the computation 
of the subspace projection. Given this example, it is reasonable to assert that 
computation considerations are important in model comparisons. Examples of 
computation-distortion tradeoffs are given for the case of orthogonal matching 
pursuit in [168]; a preliminary treatment of general computation-rate-distortion 
theory is given in [93]. 


6.3.9 TIME-FREQUENCY DICTIONARIES 


In a compact model, the atoms in the expansion correspond to basic signal fea- 
tures. This is especially useful for analysis and coding if the atoms can be de- 
scribed by meaningful parameters such as time location, frequency modulation, 
and scale; then, the basic signal features can be identified and parameterized. 
Matching pursuit using a large dictionary of such atoms provides a compact, 
adaptive, parametric time-frequency representation of a signal [139]. Several 
types of dictionaries are discussed below. 
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6.3.1 Gabor Atoms 


Localized time-frequency atoms were introduced by Gabor from a theoretical 
standpoint and according to psychoacoustic motivations [74, 75]. The literature 
on matching pursuit has focused on using dictionaries of Gabor atoms since 
these are generally appropriate expansion functions for time-frequency signal 
models [139].’ In continuous time, such atoms are derived from a single unit- 
norm window function g(t) by scaling, modulation, and translation: 


1 t-T\ «4 
9{8,w,7} (t) — Vs" (=) ejult 7), (6.39) 


This definition can be extended to discrete time by a sampling argument as in 


[139]; fundamentally, the extension simply indicates that Gabor atoms can be 
represented in discrete time as 


Isw,7}r] = falrn — teie(n—7) (6.40) 


where f,(n] is a unit-norm window function supported on a scale s. 

Note that Gabor atoms are scaled to have unit-norm and that each is indexed 
in the dictionary by a parameter set {s,w,7}. This parametric structure allows 
for a simple description of a specific dictionary, which is useful for compression. 
When the atomic parameters are not tightly restricted, Gabor dictionaries are 
highly overcomplete and can include both Fourier and wavelet bases; examples 
of Gabor atoms are depicted in Figure 6.3. One issue of importance is that 
the modulation of an atom can be defined independently of the time shift, or 
dereferenced, as it will be referred to hereafter: 


54 s,w,7} [7] = fa{n — T]e2" = 7 OF 6 a 7} [7] (6.41) 


This simple phase relationship will have an impact in later considerations; note 
that this distinction between models of time is analogous to the issue discussed 
in Section 2.2.1 in the context of the STFT time reference. 

In applications of Gabor functions, g[n] is typically an even-symmetric win- 
dow. The associated dictionaries thus consist of atoms that exhibit symmetric 
time-domain behavior. This is problematic for modeling asymmetric features 
such as transients, which occur frequently in natural signals such as music. Fig- 
ure 6.4(a) shows a typical transient from linear system theory, the damped sinu- 
soid; the first stage of a matching pursuit based on symmetric Gabor functions 
chooses the atom shown in Figure 6.4(b). This atom matches the frequency 
behavior of the signal, but its time-domain symmetry results in a pre-echo as 
indicated. The atomic model has energy before the onset of the original sig- 
nal; as a result, the residual has both a pre-echo and a discontinuity at the 
onset time as shown in Figure 6.4(c). In later stages, then, the matching pur- 
suit must incorporate small-scale atoms into the decomposition to remove the 


1 Atoms corresponding to wavelet and cosine packets have also been considered [32, 168). 
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Figure 6.3. Symmetric Gabor atoms. Such time-frequency dictionary elements are derived 
from a symmetric window by scaling, modulation, and translation operations as described 
in Equation (6.39). 
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Figure 6.4. A pre-echo is introduced in atomic models of transient signals if the atoms are 
symmetric. The plots show (a) a damped sinusoidal signal, (b) the first atom chosen from 
a symmetric Gabor dictionary by matching pursuit, and (c) the residual. Note the pre-echo 
in the atomic model and the artifact in the residual at the onset time. 


pre-echo and to model the discontinuity. One approach to this problem is the 
high-resolution matching pursuit algorithm proposed in [95, 107], where sym- 
metric atoms are still used but the selection metric is modified so that atoms 
that introduce drastic artifacts are not chosen for the decomposition. Another 
approach is to use a dictionary of asymmetric atoms, e.g. damped sinusoids. 


6.3.2 Damped Sinusoids 


The common occurrence of damped oscillations in natural signals justifies con- 
sidering damped sinusoids as signal model components; also, damped sinusoids 
are physically better suited than symmetric Gabor atoms for representing tran- 
sients. Like the atoms in a general Gabor dictionary, damped sinusoidal atoms 
can be indexed by characteristic parameters, namely the damping factor a, 
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Figure 6.5. Damped sinusoids: Gabor atoms based on a one-sided exponential window. 


modulation frequency w, and start time rT: 
Kaw,r}|2] = Sa al?—7) eF4("—Tauln — 7], (6.42) 
or, if the modulation is dereferenced, 
G{aw,r}ln] = Sa al?—TNeIenyln —T], (6.43) 


where the factor S, is included for unit-norm scaling. Examples are depicted 
in Figure 6.5. It should be noted that these atoms can be interpreted as Gabor 
functions derived from a one-sided exponential window; their asymmetry distin- 
guishes them from typical Gabor atoms, however. Also, their atomic structure 
is more readily indicated by a damping factor than a scale parameter, so the 
dictionary index set {a,w,7} is used instead of the general Gabor set {s,w,7}. 

A damped sinusoidal atom corresponds to the impulse response of a filter 
with a single complex pole; this is a suitable property given the intent of rep- 
resenting transient signals, especially if the source of the signal can be well 
modeled by simple linear systems. For the sake of realizability, however, it 
is necessary to deviate somewhat from this relationship between the atoms 
and infinite impulse response filters. Specifically, a damped sinusoidal atom is 
truncated to a finite duration when its amplitude falls below a threshold 7’; 
the resulting length is L = [logT/loga], and the appropriate scaling factor is 
then S, = ,/(1 — a?)/(1 —a?“). Note that these truncated atoms have sensi- 
ble localization properties; heavily damped atoms are short-lived, and lightly 
damped atoms persist in time. 

Several approaches in the literature have dealt with time-frequency atoms 
having exponential behavior. In [72], damped sinusoids are used to provide a 
time-frequency representation in which transients are identifiable. In the appli- 
cation outlined in [72], some prior knowledge of the damping factor is assumed, 
which is reasonable for detection applications but inappropriate for deriving 
decompositions of arbitrary signals; extensions of the algorithm, however, may 
prove useful for signal modeling. In [103], wavelets based on recursive filter 
banks are derived; these provide orthogonal expansions with respect to basis 
functions having infinite time support. This treatment focuses on the more 
general scenario of overcomplete expansions based on recursive filters; unlike 
in the basis case, the constituent atoms have a flexible parametric structure. 
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6.3.3 Composite Atoms 


The simple example of Figure 6.4 shows that symmetric atoms are inappro- 
priate for modeling some signals. While the Figure 6.4 example is motivated 
by physical considerations, i.e. simple linear models of physical systems, it cer- 
tainly does not encompass the wide range of complicated behaviors observed in 
natural signals. It is of course trivial to construct examples for which asymmet- 
ric atoms would prove similarly ineffective. Thus, given the task of modeling 
arbitrary signals, it can be argued that a wide range of both symmetric and 
asymmetric atoms should be present in the dictionary. Such composite dictio- 
naries are considered here. 

One approach to generating a composite dictionary is to simply merge a 
dictionary of symmetric atoms with a dictionary of damped sinusoids. The 
pursuit described in Section 6.2 can be carried out using such a dictionary, 
but the atomic index set requires an additional parameter to specify which 
type of atom the set refers to. Also, the nonuniformity of the dictionary in- 
troduces some difficulties in the computation and storage of the dictionary 
cross-correlations needed for the correlation update of Equation (6.15). Such 
computation issues will be discussed in Section 6.4.3. 

It is shown in Section 6.4.1 that correlations with damped sinusoidal atoms 
can be computed with low cost without using the update formula of Equation 
(6.15). The approach applies both to causal and anticausal damped sinusoids, 
which motivates considering two-sided atoms constructed by coupling causal 
and anticausal components. This construction can be used to generate sym- 
metric and asymmetric atoms; furthermore, these atoms can be smoothed by 
simple convolution operations. Such atoms take the form 


9{0,b,J,w,7} [7] = F{0,b,5}|N — T]ei¥(n-7) ’ (6.44) 
or, if the modulation is dereferenced, 
§{0,b,Jw,r} 17 = f{a,b,J} [n - Tle", (6.45) 


where the amplitude envelope is a unit-norm function constructed using a causal 
and an anticausal exponential according to the formula 
ftao,7}["] = Sta,o,7} (a u[n] + B-"u[—n] — d[n]) * hy[n), (6.46) 


where d(n] is subtracted because the causal and anticausal components, as 
written, overlap at n = 0. The function hj|n] is a smoothing window of length 
J; later considerations will be limited to the case of a rectangular window. A 
variety of composite atoms are depicted in Figure 6.6. 

The unit-norm scaling factor S,, 5,7} for a composite atom is given by 


1 


S019) = =, 
ee VT@8,5) 


where Y(a, b, J) denotes the squared norm of the atom prior to scaling: 


T(a,b, J) = > \(a"u[n] + b~u[—n] — d[n]) * Ay[n] ° ; (6.48) 


(6.47) 
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Figure 6.6. Composite atoms: Symmetric and asymmetric atoms constructed by coupling 
causal and anticausal damped sinusoids and using low-order smoothing. 


which can be simplified to 


J-1J-1 |_| [t—k| [t—klp _ a pll—h| 
a b a b—ab 
T (a, b, J) = = ) 1l-a@ 1_-p + — @a-b” (6.49) 
l=0 k=0 


which does not take truncation of the atoms into account. This approximation 
does not introduce significant errors if a small truncation threshold is used; 
however, if some error is introduced, the analysis-by-synthesis iterations in the 
pursuit work to remove the error at later stages. For the case of symmetric 
atoms (a = b), the squared norm Y(a, a, J) can be written in closed form as 


[J(1 — at) + 2aJ(1 — a*)(1 + a”) — 4a(a? +a+1)(1-a7)] 


(1+a)(1—a) (6.50) 


where a rectangular smoothing window has been assumed in the derivation. 
This scale factor affects the computational cost of the algorithm, but primarily 
with respect to precomputation. 

The composite atoms described above can be written in terms of unit-norm 
constituent atoms: 


5{0,b,J,w,7} (7 


~+ ~~ 
Gfaw,ryl 4) Dfb,r}l 
= Stas} (Zee + a _ a) + hy[n] (6.51) 


Garay  Kowwsr [n] 
Sta,b,J} y, Meee ee oo +e -d[n+A], (6.52) 


where Fa, ,}[n] is a causal atom and gy, ., 7} is an anticausal atom defined 
as 


Gfbu,r3("] = St p(T) eJ4My/_(n — T)]. (6.53) 


Note that atoms with dereferenced modulation are used in the construction 
of Equation (6.52) so that the modulations add coherently in the sum over 
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the time lags A; in the other case, the constituent atoms must be summed 
with phase shifts e7”“ to achieve coherent modulation of the composite atom. 
As will be seen in Section 6.4.2, this atomic construction leads to a simple 
relationship between the correlations of the signal with the composite atom 
and with the underlying damped sinusoids, especially in the dereferenced case. 
The special case of symmetric atoms (a = b), one example of which is shown 
in Figure 6.6, suggests the use of this approach to construct atoms similar 
to symmetric Gabor atoms based on common windows. Given a unit-norm 
window function w[n], the issue is to choose a damping factor a and a smoothing 
order J such that the resultant f,o.4,7}[n] accurately mimics w[n]. Using the 
two-norm as an accuracy metric, the objective is to minimize the error 


e(a, J) = |f{a,a,J} [n] ~~ w[n]||? (6.54) 


by optimizing a and J. Since fya4,7;([n] and w[n] are both unit-norm, this 
expression can be simplified to 


e(a, J) = 2 ( — > foasnttuts . (6.55) 


Not surprisingly, the overall objective of the optimization is thus to maximize 
the correlation of fi,.4,7;[n] and w[n], 


é(a,J) = >> fta,a,s}[nJwl[n]. (6.56) 


In an implementation, this is not an on-line operation but rather a precomputa- 
tion indicating values of a and J to be used in the parameter set of the compos- 
ite dictionary. Interestingly, this precomputation itself resembles a matching 
pursuit. Note that the values of a and J for the functions f,q 4,;}[n] in the com- 
posite dictionary are based on the scales of symmetric behavior to be included 
in the dictionary. Presumably, closed form solutions for a and J can be found 
for some particular windows; such solutions are of course limited by the require- 
ment that J be an integer. The intent of this treatment, however, is not to 
investigate the computational issue of window matching per se, but instead to 
provide an existence proof that symmetric atoms constructed from one-sided 
exponentials by simple operations can reasonably mimic Gabor atoms based 
on standard symmetric windows. Figure 6.7 shows an example of a composite 
atom that roughly matches a Hanning window and a Gaussian window. 

It has been shown that a composite dictionary containing a wide range of 
symmetric and asymmetric atoms can be constructed by coupling causal and 
anticausal damped sinusoids. Atoms resembling common symmetric Gabor 
atoms can readily be generated, so standard symmetric atoms can be included 
as a dictionary subset; there is thus no generality lost by constructing atoms 
in this fashion. Furthermore, the construction is useful in that the pursuit 
computations can be carried out efficiently; the computational framework is 
developed in Sections 6.4.1 and 6.4.2. 
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Figure 6.7. Symmetric composite atoms: An example of a smoothed composite atom 
(solid) that roughly matches a Hanning window (dashed) and a Gaussian window (dotted). 


6.3.4 Signal Modeling 


In atomic modeling by matching pursuit, the characteristics of the signal es- 
timate depend on the structure of the time-frequency dictionary used in the 
pursuit. Consider the successively refined model in Figure 6.8, which is derived 
by pursuit with a dictionary of symmetric Gabor atoms. In its early itera- 
tions, the pursuit yields smooth estimates of the global signal behavior because 
it chooses large-scale atoms that are inherently smooth. At later stages, the 
algorithm chooses atoms of smaller scale to refine the estimate; for instance, 
small-scale atoms are incorporated to remove pre-echo artifacts. 

In the example of Figure 6.9, the model is derived by matching pursuit with 
a dictionary of damped sinusoids. Here, the early estimates have sharp edges 
since the dictionary elements are one-sided functions. In later stages, edges 
that require smoothing are refined by inclusion of overlapping atoms in the 
model; also, as in the symmetric atom case, atoms of small scale are chosen in 
late stages to counteract any inaccuracies brought about by the early atoms. 

In the examples of Figures 6.8 and 6.9, the dictionaries are designed for 
a fair comparison. Specifically, the dictionary atoms have comparable scales, 
and the dictionaries are structured such that the mean-square errors of the 
respective atomic models have similar convergence properties. A comparison 
of the convergence behaviors is given in Figure 6.10(a); the plot in Figure 
6.10(b) shows the energy of the pre-echo in the symmetric Gabor model and 
indicates that the pursuit devotes atoms at later stages to remove the pre-echo 
artifact. The model based on damped sinusoids does not introduce a pre-echo. 

Modeling with a composite dictionary is depicted in Figure 6.11. The dic- 
tionary used here contains the same causal damped sinusoids as in the example 
of Figure 6.9, plus an equal number of anticausal damped sinusoids and a few 
smoothing orders. As will be seen, calculating the correlations with the under- 
lying damped sinusoids is the main factor in the cost of the composite pursuit, 
so deriving the composite model in Figure 6.11 requires roughly twice as much 
computation as the pursuit based on damped sinusoids alone. As shown in 
Figure 6.12, this additional computation leads to a lower mean-square error for 
the model. Since the parameter set for composite atoms is larger than that for 
damped sinusoids or Gabor atoms, however, a full comparison of the various 
models requires computation-rate-distortion considerations. 
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Figure 6.8. Signal modeling with symmetric Gabor atoms. The original signal in (a), 
which is the onset of a gong strike, is modeled by matching pursuit with a dictionary of 
symmetric Gabor atoms derived from a Hanning prototype. Approximate models at various 
pursuit stages are given: (b) 5 atoms, (c) 10 atoms, (d) 20 atoms, and (e) 40 atoms. Note 
the pre-echo in the reconstructions. 


6.4 COMPUTATION USING RECURSIVE FILTER BANKS 


For arbitrary dictionaries, the cost of the matching pursuit iteration can be 
reduced using the correlation update relationship in Equation (6.15). For dic- 
tionaries consisting of damped sinusoids or composite atoms constructed as 
described in Section 6.3.3, the correlation computation for the pursuit can be 
carried out with simple recursive filter banks. This framework is developed in 
the following two sections; in Section 6.4.3, the computation requirements of 
the filter bank approach and the correlation update method are compared. 


6.4.1 Pursuit of Damped Sinusoidal Atoms 


For dictionaries of complex damped sinusoids, the atomic structure can be 
exploited to simplify the correlation computation irrespective of the update 
formula in Equation (6.15). It is shown here that the computation over the 
time and frequency parameters can be carried out with simple recursive filter 
banks and FFTs. 
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Figure 6.9. Signal modeling with damped sinusoidal atoms. The signal in (a), which is 
the onset of a gong strike, is modeled by matching pursuit with a dictionary of damped 
sinusoids. Approximate models at various pursuit stages are given: (b) 5 atoms, (c) 10 
atoms, (d) 20 atoms, and (e) 40 atoms. 
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Figure 6.10. Mean-square convergence of atomic models. Plot (a) shows the mean-square 
error of the atomic models depicted in Figures 6.8 and 6.9. The dictionaries of symmetric 
Gabor atoms (solid) and damped sinusoids (circles) are designed to have similar mean-square 
convergence for the signal in question. Plot (b) shows the mean-square energy in the pre- 
echo of the symmetric Gabor model; the pursuit devotes atoms at later stages to reduce the 
pre-echo energy. The damped sinusoidal decomposition does not introduce a pre-echo. 
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Figure 6.11. Signal modeling with composite atoms. The signal in (a), which is the onset 
of a gong strike, is modeled by matching pursuit with a dictionary of composite atoms. 
Approximate models at various pursuit stages are given: (b) 5 atoms, (c) 10 atoms, (d) 
20 atoms, and (e) 40 atoms. The composite dictionary contains the same causal damped 
sinusoids as those used in the example of Figure 6.9, plus an equal number of anticausal 
damped sinusoids and a small number of smoothing orders. 
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Figure 6.12. The mean-square error of an atomic model using composite atoms (solid) 
and the mean-square error of a model based on only the underlying causal damped sinusoids 
(circles). This plot corresponds to the composite atomic models given in Figure 6.11 and the 
damped sinusoidal decompositions of Figure 6.9. As described in the text, a full comparison 
of the models requires a consideration of the interplay of rate, distortion, and computation. 
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Correlation with complex damped sinusoids. In matching pursuit using 
a dictionary of complex damped sinusoids, correlations must be computed for 
every combination of damping factor, modulation frequency, and time shift. 
The correlation of a signal z[n] with a truncated causal atom Ia, w,r}{n] is 
given by 


T+L—1 
n+(a,w,T) = So > z[n] gi"—-T)e—Jju(n—r) | (6.57) 


N=T 


In the following, correlations with unnormalized atoms will be used: 


T+L—1 

p+(a,w,T) = S> a[n] al™-7)e-ju(n—7) (6.58) 
nN=T 

_ BAe) (6.59) 
a 


Formulating the correlations in terms of unnormalized atoms will serve to sim- 
plify the notation and to reduce the cost of the composite pursuit algorithm 
developed in Section 6.4.2. 

The structure of the correlations in Equations (6.57) and (6.58) allows for 
a substantial reduction of the computation requirements with respect to the 
time shift and modulation parameters. These are discussed in turn below. 
Note that the correlation uses the atoms defined in Equation (6.42), in which 
the modulation is phase-referenced to 7; results for atoms with dereferenced 
modulation are given later. 


Time-domain simplification. The exponential structure of the atoms can 
be used to reduce the cost of the correlation computation over the time index; 
correlations at neighboring times are related by a simple recursion: 


pi.(a,w,7 — 1) = ae” pp, (a,w,7) + 2[7 — 1) -— aXe J¢"2[r +L —1]. (6.60) 


This is just a one-pole filter with a correction to account for truncation. If trun- 
cation effects are ignored, which is reasonable for small truncation thresholds, 
the formula becomes 


p4(a,w,r—1) = ae~”p,(ajw,r) + 27 — 1]. (6.61) 


Note that this equation is operated in reversed time to make the recursion 
stable for causal damped sinusoids; the similar forward recursion is unstable 
for a < 1. For anticausal atoms, the correlations are given by the recursion 


p_(b,w,T +1) = bei p_(b,w,r) +2[7 +1] — b%e?*Y2[r —L+1], (6.62) 
or, if truncation is neglected, 


p_(b,w,r +1) = be? p_(b,w,7) + a2[7 + 1). (6.63) 
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Figure 6.13. Filter bank interpretation and dictionary structures. The atoms in a dictio- 
nary of damped sinusoids correspond to the impulse responses of a bank of one-pole filters; 
for decaying causal atoms, the poles are inside the unit circle. These dictionaries can be 
structured in various ways as depicted above. The correlations in the pursuit are computed 
by the corresponding matched filters, which are time-reversed and thus have poles outside 
the unit circle. 


These recursions are operated in forward time for the sake of stability. 

The equivalence of Equations (6.61) and (6.63) to filtering operations sug- 
gests interpreting the correlation computation over all possible parameters 
{a;,wW;,7;} aS an application of the signal to a dense grid of one-pole filters; 
these are the matched filters for the dictionary atoms. The filter outputs are 
the correlations needed for the matching pursuit; the maximally correlated 
atom is directly indicated by the maximum magnitude output of the filter 
bank. Of course, pursuit based on arbitrary atoms can be interpreted in terms 
of matched filters, but in the general case this insight is not particularly useful; 
here, it provides a framework for reducing the computation. Note that the 
dictionary atoms themselves correspond to the impulse responses of a grid of 
one-pole filters; as in the wavelet filter bank case, then, the atomic synthesis 
can be interpreted as an application of the expansion coefficients to a synthesis 
filter bank. Fig. 6.13 depicts z-plane dictionary structures which provide for 
various tradeoffs in time-frequency resolution. 

A recursion similar to Equation (6.60) can be written for the general case of 
correlations separated by an arbitrary lag A: 


A-1 
p+(a,w,r-A) = aden, (a,w,7) + > a(n +7 — A] a®™ e Je" 
n=0 
A-1 
— ave Jul > ain+7—-A+L] a" e 7", (6.64) 
n=0 


For w = 27k/K, the last two terms can be computed using the DFT: 
p+(a,2nk/K,r-A) = adeF’4p, (a,w,r) + DFT x {a[n+7— A]ja”"} |k 
— ave J" DFT x {a[n+7-—A+L]a"}|,, (6.65) 


where n € (0, A — 1] in the latter terms, which could be combined into a single 
DFT. If truncation effects are ignored, the second DFT term is neglected and 
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the relationship is again more straightforward. Similar simplifications have 
been reported in the literature for short-time Fourier transforms using one- 
sided exponential windows [224] as well as more general cases [230]. 


Frequency-domain simplification. A simplification of the correlation com- 
putations across the frequency parameter can be achieved if the z-plane filter 
bank, or equivalently the matching pursuit dictionary, is structured such that 
the modulation frequencies are equi-spaced for each damping factor. If the 
filters (atoms) are equi-spaced angularly on circles in the z-plane, the discrete 
Fourier transform can be used for the computation over w. For w = 27k/K, 
the correlation is given by 


L-1 
p+(a,2rk/K,7) = > a[n + rate J27kn/K (6.66) 
n=0 
= DFTx {z[n+7] a"}l,, (6.67) 


where n € (0, — 1] and K > L. Thus, an FFT can be used to compute corre- 
lations over the frequency index. This formulation applies to any dictionary of 
harmonically modulated atoms. 

At a fixed scale, correlations must be computed at every time-frequency 
pair in the index set. There are two ways to cover this time-frequency index 
plane; these correspond to the dual interpretations of the STFT depicted in 
Figure 2.1. The first approach is to use a running DFT with an exponential 
window; windowing and the DFT require L and K log K multiplies per time 
point, respectively, so this method requires roughly N(L + K log K) multiplies 
for a signal of length N. The second approach is to use a DFT to initialize the 
K matched filters across frequency and then compute the outputs of the filters 
to evaluate the correlations across time; indeed, the signal can be zero padded 
such that the filters are initialized with zero values and no DFT is required. 
Recalling the recursion of Equation (6.60), this latter method requires one 
complex multiply and one real-complex multiply per filter for each time point, 
so it requires 5K N real multiplies, 2K N of which account for truncation effects 
and are not imperative. For large values of K, this is significantly less than the 
multiply count for the running DFT approach, so the matched filter approach 
is the method of choice. 


Results for dereferenced modulation. The results given in the previous 
sections hold for an atom whose modulation is referenced to the time origin of 
the atom as in Equations (6.39), (6.42), and (6.44). This local time reference 
has been adhered to since it allows for an immediate filter bank interpretation 
of the matching pursuit analysis; also, synthesis based on such atoms can be 
directly carried out using recursive filters. For the construction and pursuit of 
composite atoms, however, the dereferenced atoms defined in Equations (6.41), 
(6.43), and (6.45) are of importance. The correlation formulae for dereferenced 
damped sinusoids can be derived by combining the relation in Equation (6.41) 
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with the expression in Equation (6.58) to arrive at: 
py.(a,w,T) = e 2"7 ps (a,w,T), (6.68) 
so Equations (6.61) and (6.63) can be reformulated as 
pi(a,w,tT—-1) = apy(a,w,r) + e~#O-Yalr - 1] (6.69) 
p_(b,w,r +1) = dpy(b,w,r) + en Mair +1). (6.70) 


When the modulation depends on the atomic time origin, the pursuit can be 
interpreted in terms of a modulated filter bank; for dereferenced modulation, 
however, the equivalent filter bank has a heterodyne structure. This distinction 
was discussed at length with respect to the STFT in Section 2.2.1. As will be 
seen in Section 6.4.2, dereferencing the modulation simplifies the relationship 
between the signal correlations with composite atoms and the correlations with 
underlying damped sinusoids; for this reason, future considerations will be focus 
primarily on the case of dereferenced modulation. 


Real decompositions of real signals. If dictionaries of complex atoms are 
used in matching pursuit, the correlations and hence the expansion coefficients 
for signal decompositions will generally be complex; a given coefficient thus 
provides both a magnitude and a phase for the atom in the expansion. For 
real signals, decomposition in terms of complex atoms can be misleading. For 
instance, for a signal that consists of one real damped sinusoid, the pursuit does 
not simply find the constituent conjugate pair of atoms as might be expected; 
this occurs because an atom and its conjugate are not orthogonal. For real 
signals, then, it is preferable to consider expansions in terms of real atoms: 


Fia,0,7.0} = $54.,9}a'"-7 cos [w(n — T) + d] uln — 7], (6.71) 
or, in the case of dereferenced modulation, 
Ffa,,.7.6} = Sta0,7,g}ar7”? cos [wn + ¢] u[n — 7]. (6.72) 


The two cases differ by a phase offset which affects the unit-norm scaling as 
well as the modulation. 

In the case of a complex dictionary, the atoms are indexed by the three 
parameters {a,w,7} and the phase of an atom in the expansion is given by its 
correlation. In contrast, a real dictionary requires the phase parameter as an 
additional index because of the explicit presence of the phase in the argument of 
the cosine in the atom definition. The phase is not supplied by the correlation 
computation as in the complex case; like the other parameters, it must be 
discretized and incorporated as a dictionary parameter in the pursuit, which 
results in a larger dictionary and thus a more complicated search. Furthermore, 
the correlation computations are more difficult than in the complex case because 
the recursion formulae derived earlier do not apply for these real atoms. These 
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problems can be circumvented by using a complex dictionary and considering 
conjugate subspaces according to the formulation of Section 6.2.3. 

Conjugate subspace pursuit can be used to search for conjugate pairs of 
complex damped sinusoids; the derivation leading to Equation (6.21) verifies 
that this approach will arrive at a decomposition in terms of real damped si- 
nusoids if the original signal is real. The advantage of this method is indicated 
by Equations (6.19) and (6.20), which show that the expansion coefficients and 
the maximization metric in the conjugate pursuit are both functions of the 
correlation of the residual with the underlying complex atoms. The computa- 
tional simplifications for a dictionary of complex damped sinusoids can thus be 
readily applied to calculation of a real expansion of the form 


I 
z[n] 2 S- Sa; Ayal?) cos [win + gi], (6.73) 


i=1 


where A;e/% = a;(1) and the modulation is dereferenced. As in the complex 
case, the phases of the atoms in this real decomposition are provided directly 
by the computation of the expansion coefficients; the phase is not required as 
a dictionary index, 1.e. an explicit search over a phase index is not required in 
the pursuit. By considering signal expansions in terms of conjugate pairs, the 
advantages of the complex dictionary are fully maintained; furthermore, note 
that the dictionary for the conjugate search is effectively half the size of the 
full complex dictionary since atoms are considered in conjugate pairs. 

It is important to note that Equation (6.73) neglects the inclusion of unmod- 
ulated exponentials in the signal expansion. Such atoms are indeed present in 
the complex dictionary, and all of the recursion speedups apply trivially; fur- 
thermore, the correlation of an unmodulated atom with a real signal is always 
real, so there are no phase issues to be concerned with. An important caveat, 
however, is that the conjugate pursuit algorithm breaks down if the atom is 
purely real; the pursuit requires that the atom and its conjugate be linearly 
independent, meaning that the atom must have nonzero real and imaginary 
parts. Thus, a fix is required if real unmodulated exponentials are to be admit- 
ted into the signal model. The i-th stage of the fixed algorithm is as follows: 
first, the correlations (g,7r;) for the entire dictionary of complex atoms are com- 
puted using the simplifications described. Then, energy minimization metrics 
for both types of atoms are computed and stored: for real atoms, the metric is 
l(g,ri)|? as indicated in Equation (6.12); for conjugate subspaces, the metric is 
(g,7i)*a(1) + (g,7i)a(1)* as given in Equation (6.20), where a(1) is as defined 
in Equation (6.19) and 


2b —j2wL 
oF ) (6.74) 


1 
* _ 2 
(9{a,w,r}) Tsaww,7}) — Sa ( 1 — a2e—J2u 


These metrics quantify the amount of energy removed from the residual in 
the two cases; maximization over these metrics indicates which real component 
should be added to the signal expansion at the i-th stage to minimize the energy 
of the new residual rj, [n]. 
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The description of a signal in terms of conjugate pairs does not require more 
data than a model using complex atoms. Either case simply requires the indices 
{a,w,7} and the complex number a(1) for each atom in the decomposition. 
There is of course additional computation in both the analysis and the synthesis 
in the case of conjugate pairs, but this computation improves the model’s ability 
to represent real signals. In a sense, the improvement arises because the added 
computation enables the model data to encompass twice as many atoms in the 
conjugate pair case as in the complex case. 


6.4.2 Pursuit of Composite Atoms 


Using matching pursuit to derive a signal model based on composite atoms re- 
quires computation of the correlations of the signal with these atoms. Recalling 
the form of the composite atoms given in Equations (6.50) and (6.52), these 
correlations have, by construction, a simple relationship to the correlations with 
the underlying one-sided atoms: 


p(a, b, J,w,T) 


J-l1 -~ ~ 
”™) A — b, ? A 
Stasay 0 [Eto - Amt +A) _ a4 all 6.75) 


A=0 Sa Sb 
J—-1 

= Sov} >> [B+(a,w,7 +A) + p_(b,w,7+A) — 2[r+ Al]. (6.76) 
A=0 


The correlation with any composite atom can thus be computed based on the 
correlations derived by the recursive filter banks discussed earlier; this compu- 
tation is most straightforward if dereferenced modulation is used in the con- 
stituent atoms and if these underlying atoms are unnormalized. Essentially, any 
atom constructed according to Equation (6.52), which includes simple damped 
sinusoids, can be added to the modeling dictionary at the cost of one multiply 
per atom to account for scaling; computation issues are discussed further in 
the next section. Note that for composite atoms, real decompositions of real 
signals take the form 


I 
zn] © 2 >_ Aiffa:,b:,3:3{0 — Ti] cos (win + gi), (6.77) 
i=1 


where fa; b;,J;} is as defined in Equation (6.46) and the amplitude A; and phase 
o; are given by A;e?%! = a;(1), where a;(1) is a complex expansion coefficient 
derived according to Equation (6.19). 


6.4.3 Computation Considerations 


This section compares the computational cost of two matching pursuit im- 
plementations: pursuit based on correlation updates [139] and pursuit based 
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on recursive filter banks. The cost is measured in terms of memory require- 
ments and multiplicative operations. Simple search operations, table lookups, 
and conditionals are neglected in the cost measure. Precomputation is allowed 
without a penalty, but storage of precomputed data is included in the mem- 
ory cost. Startup cost for the first pursuit iteration is considered separately; 
in cases where only a few atoms are to be derived, the startup arithmetic in 
the update algorithm may constitute an appreciable percentage of the overall 
computation. 


Notation. The following treatment involves modeling a real signal of length 
N using a composite dictionary based on damped sinusoids. The dictionary 
parameters consist of A different causal damping factors, B anticausal damping 
factors, H smoothing orders, K modulations, and N time shifts. The dictionary 
thus has M = ABH KN atoms; using S to denote the number of scales, namely 
S = ABH, the dictionary size is given by M = SKN. The average scale or 
atom length will be denoted by L; the correlation (g,z) thus requires L real- 
complex multiplies on average. The following comparison focuses on pursuit 
of complex atoms since the evaluation of a real model based on a complex 
pursuit has equal cost in both matching pursuit implementations; also, deriving 
the correlation magnitudes requires the same amount of computation in both 
approaches. The relevant computation to compare is that required to calculate 
(g,ri) for all of the complex atoms g € D at some stage 7 of the algorithm. 


Precomputation in the update algorithm. The update approach com- 
putes the correlations needed for the pursuit using Equation (6.15), which re- 
lates the correlations at stage 1 +1 to those computed at stage 1. This method 
relies on precomputation and storage of the dictionary cross-correlations (g, g;) 
to reduce the cost of the pursuit. If this storage is done without taking the 
sparsity or redundancy of the data into account, M? cross-correlations must 
be stored. A simple example shows that such a brute force approach is pro- 
hibitive. Consider analysis of a 10ms frame of high-quality audio consisting 
of N = 400 samples. In a rather small dictionary with K = 32, A = 10, 
B=1, and H = 1, there are roughly M = 10° atoms. Storage of the complex- 
valued cross-correlations then requires 2M? = 2 x 10!° memory locations. This 
is altogether unreasonable, so it is necessary to investigate the possibility of 
memory-computation tradeoffs. 

The memory requirement can be reduced by considering the sparsity and 
redundancy of the cross-correlation data. First, many of the atom pairs have 
no time overlap and thus zero correlation; these cases can be handled with 
conditionals. For atoms that do overlap, the correlation storage can be reduced 
using the following formulation. Introducing the notation 


g(80, Wo; To) = 9{00,b0,Jo,wo,70} (72) = ftsoy[n — To]e**°” (6.78) 
g(81,W1,71) 9{ 01 ,b1 1 wr 71} 17 — fts3[n—- nije", (6.79) 
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where 8p and s; serve as shorthand for the effective scales of the atoms and f[n] 
is a unit-norm envelope constructed as in Equation (6.46), the cross-correlation 
of two composite atoms is given by 


(9(80, Wo, 70) 9(81,41,71)) = > frsoyln — Tol f{s,}[n — Je" (6.80) 


{let m=n—-to} = Y- freoylmfpe,ylm — (71 — To)JeF2 HF) (6.81) 
™ 
= elon ~wo)To (9(80, 0, 0), g(81,W1 — W0,7T1 — T))- (6.82) 


Thus, with the exception of a phase shift, the cross-correlation depends only 
on the relative time locations and modulation frequencies of the atoms. Fur- 
thermore, it only depends on the absolute frequency difference since negative 
values of w; — wo can be accounted for by conjugation: 


(9(80, 0,0), 9(s1, 41 — Wo,71 — To)) = (g(so, 0, 0), 9(81, Wo — 061,71 — 7) : 
6.83 


Conjugation can also be used to handle redundancy in the cross-correlations 
for scale pairs: 


(9(81, 1, 71), 9(80, Wo, To)) = (9(80, Wo, 70), 9(81,W1,71)) (6.84) 
= eJ(vo—w1)To0 (g(8o, 0,0), g(s1, 04 —W9,T71 — T))". (6.85) 


This scale property serves to reduce the memory requirements by roughly a 
factor of two. 

The formulations given above drastically reduce the amount of memory re- 
quired to store the dictionary cross-correlations. For the modulation frequen- 
cies, there are K distinct possibilities for |w, —wo|. For the time shifts, the S 
different scales can be considered in pairs using L to approximate the number 
of lags that lead to overlap and nonzero correlation; there are roughly $?L 
different configurations. In total, then, 27K L memory locations are required 
to store the distinct cross-correlation values; the scale-pair redundancy reduces 
this count to S?KL. For the simple example discussed above, this amounts to 
about 6 x 10* locations for L = 20. Noting the phase shift in Equation (6.82), 
this reduction in the memory requirements is achieved at the cost of a complex 
multiply, or three real multiplies, for each correlation update.? 


Precomputation in the filter bank algorithm. In the filter bank ap- 
proach, the pursuit computation is based on correlations with unnormalized 


*The complex multiply (a + bj)(c+ dj) =ac—bd+j (ad + bc) can be carried out using three 
multiplies by computing c(a+6), b(c+d), and d(a—b). Then, ac—bd is given by the difference 
of the first two terms; ad + bc is the sum of the second and the third terms. 
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atoms as formalized in Equation (6.76), which holds for any dictionary of com- 
posite atoms or simple damped sinusoids (where H = 1 and B = 1). This 
correlation computation requires scaling by Stq.7}, so these scaling factors 
are precomputed and stored. When the causal and anticausal damping fac- 
tors do not have any particular symmetry, storing the scaling factors requires 
S = ABH memory locations. 


The first iteration of the update algorithm. In the first stage of the 
update algorithm, all of the signal correlations with the dictionary atoms must 
be computed, which requires ML = SKL real-complex multiplies, or 2M L 
real multiplies; storing the results requires 2M locations, so the total memory 
needed in the update algorithm is S*KL+2M. Note that the computation 
could be carried out with recursive filter banks at a lower cost, but such a 
merged approach will not be treated here. 


Later iterations of the update algorithm. Once the dictionary cross- 
correlations have been precomputed and the correlations for the first stage of 
the pursuit have been calculated and stored, the cost of the update algorithm 
depends only on the update formula. Each stage of the algorithm involves M 
complex-complex multiplies (3M real) to multiply the M cross-correlations by 
a;, plus another M complex-complex multiplies to carry out the phase shift 
given in Equation (6.82), for a total of 6M real multiplies per iteration. Note 
that in the update algorithm it is not necessary to keep the signal in memory 
after the first iteration or to ever actually compute the residual signal. 


Iterations in the filter bank approach. In matching pursuit based on 
recursive filter banks, the scaling factors S,, 5,7} are precomputed and available 
via lookup. In addition to the scaling factors, the residual signal must be 
stored, which requires N memory locations. The final memory requirement 
is the storage of the correlations with the constituent unnormalized damped 
sinusoids, which are needed to compute the correlations with the composite 
atoms. For some smoothing order J, correlations with J causal and J anticausal 
damped sinusoids are required. Storing these underlying correlations in a local 
manner requires 2(A + B)K J locations, where the factor of two arises because 
the correlations are complex; global storage of these correlations requires 2(A+ 
B)KN locations, so the worst case memory requirement in the filter bank case 
is S+N+2(A+B)KN. 


The filter bank algorithm uses (A+B)K recursive filters to derive the corre- 
lations. In the dereferenced case of Equations (6.69) and (6.70), each recursion 
requires four real-real multiplies for each of the N time points if atom trunca- 
tion is neglected, or six if truncation is included. As given by Equation (6.76), 
correlations with composite atoms are computed by adding the correlations 
with constituent unnormalized damped sinusoids and then scaling with the ap- 
propriate factor; this process introduces S = ABH real-complex multiplies, or 
2S real multiplies. Thus, 6(A + B)KN +2ABH real multiplies are needed to 
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COMPUTATION 
(real multiplies) 


MEMORY 
(real numbers) 


First Later 
Method Precomp. Algorithm iteration iterations 
S°KL = 2M = 2ML= 6M = 
A°B°H’? KL 2ABHKN 2ABHKNL | 6ABHKN 
S=ABH | N+2(A+B)KN | 5L+2ABH+6(A+ B)KN 


Table 6.1. Comparison of memory and computation requirements for matching pursuit 
using the correlation update approach and the recursive filter bank method. N is the length 
of the signal; the dictionary index set contains A causal damping factors, B anticausal 
damping factors, H smoothing orders, S = ABH scales, K modulations, and N time 
shifts; the dictionary thus contains M = SKN = ABHKN distinct atoms. L is the 
average time support of a dictionary atom. 


compute the pursuit correlations. Once an atom is chosen based on these cor- 
relations, the residual must be updated; this requires roughly 5L multiplies to 
generate the unit-norm atomic envelope, modulate it to the proper frequency, 
and weight it with its expansion coefficient prior to subtraction from the signal. 
The total computational cost per iteration for the filter bank algorithm is thus 
5[D+6(4+ B)KN 4+ 2S. 


Quantitative comparison. The results of this section are summarized in 
Table 6.1. To quantify the comparison, consider modeling a signal of length 
N = 400 with a dictionary having A = 20, B = 10, H = 10, L = 30, and 
K = 32. The update method requires storage of 3.8 x 10° precomputed values 
and 5.1 x 10’ values for a given iteration, while the filter bank method requires 
2 x 10° precomputation locations and 7.7 x 10° locations for a given iteration; 
the filter bank approach requires less memory. The update method carries out 
1.5 x 10° multiplies for the first iteration and 1.5 x 10® multiplies for each 
iteration thereafter. The filter bank framework requires 2.3 x 10° multiplies for 
each iteration, so it provides a considerable reduction in the cost of the pursuit 
computation. 


6.5 CONCLUSION 


Atomic models provide descriptions of signals in terms of events that are lo- 
calized in time-frequency. Derivation of optimal models based on overcomplete 
sets of atoms is computationally prohibitive, but effective models can be ar- 
rived at by greedy algorithms such as matching pursuit and its variations. In 
this chapter, matching pursuit was developed as an approach for deriving com- 
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pact signal-adaptive parametric models based on dictionaries of time-frequency 
atoms; such pursuit provides an analysis method for granular audio synthe- 
sis. It is important to note that such dictionary-based models are tradition- 
ally classified as nonparametric, but that this categorization is inappropriate 
here since the dictionary atoms chosen for the pursuit models can be fully 
described by physically meaningful parameters. In Section 6.3, dictionaries 
consisting of symmetric Gabor atoms, damped sinusoids, and composite atoms 
constructed from underlying damped sinusoids were considered and compared. 
It was shown that the matching pursuit computation for both damped sinu- 
soidal atoms and composite atoms can be carried out efficiently using simple 
recursive filter banks. 

With respect to the pursuit of damped sinusoidal atoms, it should be noted 
that estimation of the parameters of damped sinusoids in a signal has been 
widely considered for the tasks of system identification and spectral estimation 
(71, 114, 115, 119, 122, 203, 222, 229]. These applications, however, usually 
involve an underlying source-filter model of the signal and thus differ from the 
task of generalized signal modeling. Of course, the estimation of damped sinu- 
soidal parameters by matching pursuit can be tailored for spectral estimation 
and system identification, but this requires constraining the pursuit in various 
ways. While this relationship between overcomplete expansions based on para- 
metric dictionaries and parameter estimation methods such as ESPRIT has 
yet to be extensively considered, it has been shown that a greedy algorithm 
related to damped sinusoidal pursuit is effective for spectral estimation of non- 
stationary signals [175]; greedy parameter estimation seems to be well-suited 
for scenarios where little or no a priort information about the signal is available. 


{ CONCLUSIONS 


And the end of all our exploring 


Will be to arrive where we started 
And know the place for the first time. 


— T.S.Eliot, “Little Gidding” 


This book explores a variety of signal models, namely the sinusoidal model, 
multiresolution sinusoidal models, residual models, pitch-synchronous wavelet 
and Fourier representations, and atomic decompositions. The key issues dealt 
with in this text are summarized in the following section; thereafter, directions 
for further research are discussed. 


7.1 SIGNAL-ADAPTIVE PARAMETRIC REPRESENTATIONS 


In modeling a signal, it is of primary importance that the model be adapted 
to the signal in question. Otherwise, the model will not necessarily provide 
a meaningful or useful representation of the signal. The models considered 
in this book are examples of such signal-adaptive models. In each case, the 
model is constructed in a signal-adaptive fashion; this leads to compact models 
which are useful for analysis, compression, denoising, and modification. Some 
of these capabilities are enhanced by the parametric nature of the models. If 
a signal is represented in terms of perceptually salient parameters, meaningful 
modifications can be made by simple adjustment of the parameters; further- 
more, perceptual principles can be readily applied to achieve data reduction. 
The following sections provide a review of the main issues discussed in each 
chapter. 


7.1.1 The STFT and Sinusoidal Modeling 


In Chapter 2, the sinusoidal model is developed as a parametric extension of 
the short-time Fourier transform. The filter bank interpretation of the STFT 
is reviewed and extended, and various perfect reconstruction criteria are devel- 
oped. In Section 2.2.2, however, it is shown by a simple example that such a 
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rigid filter bank does not provide a compact representation of an evolving sig- 
nal. This motivates representing the subband signals in terms of a parametric 
model based on estimating and tracking evolving sinusoidal partials. Analysis 
methods for estimating the partial parameters are considered; the treatment 
includes a linear algebraic interpretation of spectral peak picking. Also, time- 
domain and frequency-domain synthesis techniques are discussed. 


7.1.2 Multiresolution Sinusoidal Modeling 


If operated with a fixed frame size, the sinusoidal model has difficulties rep- 
resenting nonstationary signals. Accurate reconstruction of dynamic behavior 
can be achieved by carrying out the sinusoidal model in a multiresolution frame- 
work. In Chapter 3, two multiresolution extensions based respectively on filter 
banks and adaptive time segmentation are discussed; the focus is placed pri- 
marily on the latter method, which is shown to substantially mitigate pre-echo 
distortion. A dynamic program for deriving pseudo-optimal segmentations is 
developed; furthermore, globally exhaustive and simple heuristic algorithms 
are both considered, and the various approaches are compared with respect to 
computational cost. 


7.1.3 Residual Modeling 


In parametric methods such as the sinusoidal model, the analysis-synthesis pro- 
cess generally does not lead to a perfect reconstruction of the original signal; 
there is a nonzero difference between the original and the inexact reconstruc- 
tion. For high-quality synthesis, it is important to model this residual and 
incorporate it in the signal reconstruction; this accounts for salient features 
such as breath noise in a flute sound. In Chapter 4, residual modeling for si- 
nusoidal analysis-synthesis is discussed. For multiresolution sinusoidal models, 
the residual can be perceptually well-modeled as white noise shaped by a filter 
bank with time-varying channel gains whose subbands are spaced in frequency 
according to psychoacoustic considerations. The channel gains are determined 
by analyzing the residual; these gains serve as an efficient parametric repre- 
sentation of the residual. Strictly speaking, this residual analysis-synthesis is 
not signal-adaptive; however, it is necessary to consider such methods for use 
with near-perfect reconstruction models such as those described in this text. 
When used in conjunction with the sinusoidal model, this approach leads to 
high-fidelity reconstruction of natural sounds. 


7.1.4 Pitch-Synchronous Models 


For pseudo-periodic signals, compaction can be achieved by incorporating the 
pitch in the signal model. In Chapter 5, pitch-synchronous modeling and pro- 
cessing is discussed. It is shown that both the sinusoidal model and the wavelet 
transform can be improved by pitch-synchronous operation when the original 
signal is pseudo-periodic. In either approach, periodic signal regions can be 
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efficiently represented while aperiodic regions, e.g. note onsets, can be modeled 
using the perfect reconstruction capability of the underlying transform, namely 
the discrete wavelet transform in the pitch-synchronous wavelet case and the 
Fourier transform in the pitch-synchronous sinusoidal model. 


7.1.5 Atomic Decompositions 


In Chapter 3, the sinusoidal model is interpreted as a decomposition in terms 
of time-frequency atoms constructed according to parameters extracted from 
the signal; this interpretation motivates the various multiresolution extensions 
of the model. In Chapter 5, pitch-synchronous transforms are similarly inter- 
preted as granulation methods; in those approaches, a pseudo-periodic signal is 
decomposed into pitch period grains according to estimates of the signal period- 
icity, and these grains are further modeled using Fourier or wavelet techniques. 
The atomic models discussed in Chapter 6 differ from these representations in 
that the atoms for the model are not derived from signal parameters; rather, 
parametric atoms that match the signal behavior are chosen from an overcom- 
plete dictionary. Such dictionary-based models are traditionally classified as 
nonparametric, but the parametric nature of the atoms suggests that this clas- 
sification is inappropriate. Indeed, such atomic decompositions are intrinsically 
similar to standard parametric representations such as the sinusoidal model. 
Atomic models based on overcomplete dictionaries of time-frequency atoms 
can be computed using the matching pursuit algorithm. Typically, the dictio- 
naries consist of Gabor atoms based on a symmetric prototype window; such 
atoms have difficulties representing transient behavior, however. With the goal 
of overcoming this problem, alternative dictionaries are considered, namely 
dictionaries of damped sinusoids as well as dictionaries of general asymmetric 
atoms constructed based on underlying causal and anticausal damped sinu- 
soids. It is shown in Section 6.4 that the matching pursuit computation for 
either type of atom can be carried out with low-cost recursive filter banks. 


7.2 RESEARCH DIRECTIONS 


The work discussed in this book has a number of natural extensions. This 
section describes extensions in audio coding and provides suggestions for further 
work involving overcomplete expansions. 


7.2.1 Audio Coding 


The current standard methods in audio coding use cosine-modulated filter 
banks; perceptual criterion and prediction models are applied to the subband 
signals to achieve data reduction [26, 166, 178, 223]. Some signal adaptivity 
is achieved by adjusting the filter lengths according to the signal behavior; in 
terms of the prototype window for the filter bank, a short window is used in 
the vicinity of transients and a long window is used for stationary regions. It 
is an open question whether the rate-distortion performance of these window 
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switching filter bank approaches can be rivaled by parametric methods such as 
the sinusoidal model or overcomplete atomic models. 


Sinusoidal modeling. In the sinusoidal model, which has received recent at- 
tention for the application of audio coding, quantization is a primary open issue 
[130, 225]. For instance, it is of interest to incorporate perceptual resolution 
limits in the amplitude and frequency quantization schemes. Another impor- 
tant psychoacoustic consideration is the formal characterization of distortion 
artifacts such as pre-echo; such characterizations are required if the method is 
to be compared to standard techniques. 


For audio coding using the sinusoidal model, a number of data reduction 
techniques and modeling improvements are of possible interest. First, predictive 
models of the partial tracks in time and frequency may be useful for data 
reduction; linear prediction of the spectral envelope has been applied to speech 
coding based on the sinusoidal model [3]. Such prediction may also prove 
useful for assisting with the estimation of sinusoidal parameters in upcoming 
signal frames. In this light, the sinusoidal model holds some promise for the 
application of audio transmission on packet-lossy networks; signal segments 
corresponding to lost packets can be reconstructed by using models of track 
evolution to interpolate the parameters from adjacent received packets. 


Since compaction leads to coding gain, audio coding using the sinusoidal 
model would benefit from the ability to derive the most compact sinusoidal rep- 
resentation of a signal. Given that the expansion functions in an oversampled 
DFT correspond to a tight frame, methods of obtaining optimal or pseudo- 
optimal sparse frame expansions provide a means for obtaining such optimally 
compact sinusoidal models. In this sense, some improvements in sinusoidal 
modeling techniques may arise from the theory of overcomplete expansions. 
Note that a procedure similar to matching pursuit is used in [77] to estimate 
sinusoidal components; as discussed in Chapter 6, however, such pursuit does 
not yield an optimal compact representation, so some improvement can be 
achieved. In addition to improvements in compaction due to frame-theoretic 
approaches, further investigations along such lines may indicate successive re- 
finement frameworks for the sinusoidal model that offer advantages over current 
techniques. 


Multiresolution sinusoidal models are of significant interest for audio coding 
for a number of reasons beyond their improved signal representation capabili- 
ties. For one, dynamic segmentation allows for optimization of the model in a 
rate-distortion sense, which is of course useful for coding applications; further- 
more, psychoacoustic criterion such as perceptual entropy can be incorporated 
in the dynamic program to determine the optimal segmentation. Multirate fil- 
ter bank methods coupled with sinusoidal modeling are also of interest for audio 
coding since they allow for modeling and synthesis at subsampled rates; such 
efficient synthesis is of great importance given the applications of audio record- 
ing and broadcasting, both of which demand real-time signal reconstruction. 
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The possible advances suggested above can be viewed as steps in the de- 
velopment of a fully optimal sinusoidal model. In addition to an appropriate 
multiresolution scheme such as dynamic segmentation, achieving a fully optimal 
model indeed requires global consideration of the parameter estimation tech- 
nique (e.g. spectral peak picking), the line tracking method, and the parameter 
interpolation functions used for reconstruction. These various components of 
sinusoidal analysis-synthesis are intrinsically interdependent; it is an open ques- 
tion as to how these dependencies can be accounted for in model optimization. 


Finally, it should be noted that it is of interest in the multimedia community 
to carry out signal modifications in the compressed domain. Some modifications 
based on filter bank audio compression have been developed, but these are 
somewhat restricted in comparison to the rich class of modifications enabled 
by a sinusoidal signal model [130]. 


Atomic models. Whereas there are clear indications that the sinusoidal 
model may be useful as an audio coding scheme, it has not yet been shown 
that atomic models based on overcomplete dictionaries are similarly promising. 
One fundamental advance required for application of atomic modeling to audio 
coding is the ability to carry out matching pursuit effectively in a frame-by- 
frame manner so that signals of arbitrary length can be processed. Matching 
pursuit using fixed frames has been described in the literature [196], but such 
an approach is unable to identify or model atomic components that overlap 
frame boundaries. 


There are several additional noteworthy points regarding atomic models 
and audio coding. First, an atomic signal model would allow elaborate time- 
frequency masking principles to be incorporated in the coding scheme. Also, 
given an atomic model, it can be expected that some coding gain can be 
achieved based on the occurrence of redundant structures in the atomic index 
sets; entropy coding of the indices may prove useful. Finally, further capabili- 
ties for identifying basic signal behavior are of interest; for instance, pursuit of 
atoms with harmonic structure may prove useful for audio signal modeling. 


Beyond audio coding, another conceivable application of atomic modeling is 
to represent the residual of some independent analysis-synthesis process such 
as the sinusoidal model. An analogy is the compression technique described 
in [162], where matching pursuit is used to derive a model of the residual in a 
motion-compensated video coder. In that approach, many simplifications arise 
due to the structure of the residual and the characteristics of visual percep- 
tion; these enable real-time analysis. It is an open question whether similar 
improvements can be developed for audio residuals. 


Finally, it should be noted that matching pursuit has received some attention 
in the image coding literature [18, 204]. With regards to this application, it 
may be of interest to use asymmetric atoms to improve modeling of edges in 
images, which is of course analogous to modeling onsets in audio signals. 
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7.2.2 Overcomplete Atomic Expansions 


In addition to further work in audio coding, the developments in this book 
suggest extensions involving overcomplete signal expansions in terms of time- 
frequency atoms. Such issues are described in the following. 


Evolutionary models. The sinusoidal model can be interpreted as an atomic 
decomposition wherein the atoms are related in an evolutionary fashion. This 
evolution model leads to synthesis robustness, modification capabilities, and 
data reduction. It would be useful to establish a similar evolution framework 
for atomic models based on overcomplete dictionaries. 


Dictionary design and optimization. In matching pursuit and similar 
methods, the performance of the algorithm depends on the contents of the 
dictionary; such algorithms perform well if the dictionary contains atoms that 
match the signal behavior. Of course, this condition is more likely to hold for 
larger dictionaries, but increased dictionary size entails increased computation 
in the algorithm. One approach to handling this tradeoff is to generate a signal- 
adaptive dictionary which can be expected to perform well for a specific signal; 
this is only of interest, however, if such a dictionary can be arrived at by a 
simple heuristic analysis rather than a high-cost optimization. 


Dictionary design issues in matching pursuit relate to codebook design is- 
sues for vector quantization. The primary difference is that vector quantization 
codebooks do not typically have the parametric structure of time-frequency dic- 
tionaries. Methods for codebook optimization are still of interest for matching 
pursuit, however, since the codebook adaptation can be restricted to adhere 
to a parametric atomic structure. This connection is briefly explored in [42]; 
given the extent of work that has been devoted to vector quantization tech- 
niques, further investigations of applications to time-frequency atomic models 
are clearly merited [80]. 


Variations of matching pursuit. Several variations of matching pursuit are 
described in Chapter 6; it is argued that comparison of such approaches calls 
for computation-rate-distortion considerations. Preliminary formalizations of 
such tradeoffs have appeared in the literature, but there are many open ques- 
tions [93]. With computation concerns in mind, it is of interest to consider 
simplifications of matching pursuit. For instance, in [139], pursuit based on 
small subdictionaries is discussed; if the subdictionaries are well-chosen, this 
helps to reduce the computational requirements without substantially affecting 
the convergence of the atomic model. One possible way to generate useful sub- 
dictionaries is to employ a pyramid multiresolution scheme in which large scale 
atoms are evaluated with respect to subsampled versions of the signal; in this 
prospective scenario, the computation is reduced since some of the correlations 
are carried out in a subsampled domain. 
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Refinement and modification. In Chapter 1, the application of reassign- 
ment methods to time-frequency distributions is briefly discussed. Such tech- 
niques start with a standard distribution and apply various refinements in order 
to achieve compaction in the time-frequency plane; this improves the readabil- 
ity of the distribution since the nonlinear refinements lead to enhancement of 
the peaks in the representation and attenuation of the cross terms [11]. In cases 
where the resources are available to derive a dispersed but exact overcomplete 
expansion using the SVD pseudo-inverse (or some other method), some form 
of adaptive refinement may prove useful for improving the compaction without 
sacrificing the accuracy of the expansion. One example of such an approach 
is as follows. Since the dictionary is overcomplete, some components in the 
expansion can be represented in terms of the other components. Then, such 
representation vectors can be added to the expansion while zeroing the corre- 
sponding components; in this way, the same signal reconstruction can be arrived 
at from a more compact model. The caveat here is that optimal compaction is 
still not feasible given the general complexity results presented in [42]; however, 
improved models may be achieved in some cases using such a method. 

In addition to refinement of overcomplete expansions to improve compaction, 
other modifications are also of interest. In such efforts, the null space of the dic- 
tionary matrix provides a significant caveat; in short, some modifications may 
indeed map to this null space, meaning that a seemingly elaborate modification 
of the atomic components may indeed have no effect on the signal reconstruc- 
tion. The open question in this area is that of establishing constraints on 
modifications to ensure robustness, %.e. predictability of the modifications. 


Appendix A 
Two-Channel Filter Banks 


The discrete wavelet transform is fundamentally connected to two-channel per- 
fect reconstruction filter banks. These connections are explored in Chapter 3. 
Here, the relevant mathematical details involving two-channel perfect recon- 
struction filter banks are given. 


Two-channel critically sampled perfect reconstruction filter banks. 
The discrete wavelet transform can be derived in terms of critically sampled 
two-channel perfect reconstruction filter banks such as the one shown in Figure 
3.3. The analysis of the system is carried out here in the frequency domain; 
the time-domain interpretation will be discussed in the next section. In terms 
of the z-transforms of the signals and filters, the output of the filter bank is: 


(2) = 5 [Hol2)Go(2) + Hi(z)Gi (2)] X(2) (A.1) 
+ 5 [Ho(—2)Go(2) + Hi (—2)Gi(2)] X(-2) 
= T(z)X(z) + A(z)X(-z), (A.2) 


where 7'(z) is the direct transfer function of the filter bank and A(z) charac- 
terizes the aliasing —- the appearance of the modulated version X(—z) in the 
output. The perfect reconstruction conditions are then clearly 


T(z) 1 (A.3) 
A(z) = Q, (A.4) 


or, in terms of the filters, 


Go(z)Ho(z) + Gi(z)Hi (z) (A.5) 
Go(z)Ho(—z) + Gi(z)Mi(-z) = 0, (A.6) 
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which can be rewritten in matrix form as 


Gen Ginn [Lame ey] ~[o a] 


This condition can be expressed in a shorthand form as 
G! (z)Hm(z) = 2] (A.8) 


in terms of the modulation matrices G,,(z) and H,,(z) and the identity matrix 
I; such modulation matrices are useful in multirate filter bank theory [238]. The 
design of a perfect reconstruction filter bank then amounts to the derivation 
of four polynomials Go(z),Gi(z), Ho(z), and H,(z) that satisfy the condition 
above [238]. 

Equations (A.5) and (A.6) can be manipulated to yield a general expression 
relating the constituent filters; this will be especially useful for interpreting the 
analysis-synthesis filter bank in terms of a time-domain signal expansion. The 
first step in the derivation, which basically mirrors the treatment given in [238], 
is to rewrite Equation (A.6) as 


—G) (z)Ay (—z) 


Go(z) — Ho(—z) 


(A.9) 


Substituting this expression into Equation (A.5) and solving for G)(z) yields 
—2Ho(—-z) —2Ho(-z) 


Gil@) = FyH(—2) - Ho(-a)iia) ~ detH,(z) “2 
Similarly, 
_ 2H, (—z) 
Go(z) = je)" (A.11) 
Then, it is simple to establish the relationships 
Go(z)Holz) = = (A.12) 
Gi(2)Hy(z) = eee (A.13) 
Noting that det H,,(z) = — det H,,(-z), 
Gi(z)Hy(z) = 2Ho(=2)Hi (2) _ Go(—z)Ho(—z). (A.14) 


det H,,(-z) 
Equation (A.5) can then be transformed into the condition 


Go(z)Ho(z) + Go(—z)Ho(-z) = 2 (A.15) 
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or the analogous condition 

Gi(z) Ay (z) + G, (—z)H,(-z) = 2. (A.16) 

Equation (A.6) can also be readily manipulated using the result of Equation 
(A.14). Multiplying Equation (A.6) by Ho(z) yields: 

Go(z)Ho(z)Ho(—z) + Gi(z)Ai(-—z)Ho(z) 

G,(—z)H,(—z)Ho(-—z) + Gi(z)Mi(—z)Ho(z) 


0 (A.17) 
0 (A.18) 


=> G,(z)Ho(z) + Gi(—z)Ho(-z) = 0, (A.19) 


where the last expression must hold at least where H,(z) is nonzero; indeed, 
no generality is actually lost here since the two-channel filter bank cannot 
achieve perfect reconstruction if Ho(z) and H(z) have any common zeros [238]. 
Similarly, 


Go(z)Ai(z) + Go(-—z)Hi(-z) = 0. (A.20) 


The various z-transform relationships derived here for the critically sampled 
two-channel perfect reconstruction filter bank can be summarized in one equa- 
tion: 


Gi(z)H;(z) + Gi(-z)H;(-z) = 266 — J}. (A.21) 


In the next section, this leads to an interpretation of the filter bank in terms 
of a biorthogonal basis. 


Perfect reconstruction and biorthogonality. By manipulating the per- 
fect reconstruction condition in (A.21), it can be shown that a perfect recon- 
struction filter bank derives a signal expansion in a biorthogonal basis; the basis 
is related to the impulse responses of the filter bank. This relationship is of 
interest in that it establishes a connection between the filter bank model and 
the atomic model that underlie the discrete wavelet transform. 

The time-domain relationship corresponding to Equation (A.21) can be de- 
rived using two z-transform properties: convolution and modulation. Letting 


g{n] <> G(z) and A[n] <>} H(z), (A.22) 
the properties are as follows: 
Convolution > g[k]h[n — k] << G(z)H(z) 
k (A.23) 
Modulation (-1)"g[n] <> G(-2). 


Using these properties to express Equation (A.21) in the time domain yields: 


>. gilk]hy[m — k] + (-1)™ S— gilk]h;[m — k] 
k k 
S gilk]hy[m — k] [1 + (-1)"] 
k 


26[m]d[¢ — 7] (A.24) 


26[m]6[i — j]. (A.25) 
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For odd m, the last expression simplifies trivially to 0 = 0. For even m, replaced 
here by 2n, 


> gilk]hj[2n — k] = d[n]oli - J]. (A.26) 
k 
Equivalently, the relationship can be formulated as 


So hilklgy[2n — k] = d[n)oli - 9 (A.27) 
k 


by interchanging the filters in the convolution expression. In inner product 
notation, Equations (A.26) and (A.27) can be written as 


(gi[k], hy[2n —k]) = d[n|ofi — 3] (A.28) 
(hilk], gj[2n —k]) = d[n]éfi— 4], (A.29) 


respectively. The above expressions show that the impulse responses of the 
filters and their shifts by two, with one of the impulse responses time-reversed as 
indicated, constitute a biorthogonal basis for discrete-time signals (with finite 
energy), namely the space /?(z). Note that real filters have been implicitly 
assumed; for complex filters, the first terms in the inner product expressions 
would be conjugated. Also note that the analysis and synthesis filter banks are 
mathematically interchangeable; this symmetry is analogous to the equivalence 
of left and right matrix inverses discussed in Section 1.4.1. 

The preceding derivation indicates that perfect reconstruction and biorthog- 
onality are equivalent conditions; in the next section, this insight is used to 
relate filter banks and signal expansions. 


Interpretation as a signal expansion in a biorthogonal basis. Given 
that the impulse responses in a two-channel perfect reconstruction filter bank 
are related to an underlying biorthogonal basis, it is reasonable to consider 
the time-domain signal expansion carried out by such a filter bank. Using the 
notation of Figure 3.3, the channel signals are given by convolution followed by 
downsampling: 


yon] = > > 2[m]ho.2n-m] = (2[m], ho[2n — ml) (A.30) 
y(n] = So 2[mjki2n-m) = (e[m],m2n-m]).  (A.31) 


Upsampling followed by convolution gives the outputs of the synthesis filters, 
which can be thought of as full-rate subband signals: 


fon] = > yolm/2Igo[n — m] (A.32) 
= > yolk]go[n — 2k] = (yo[K], goln — 2k]) (A.33) 

k 
fi[n] = = (y[k],gi[n—-2k]). — (A.34) 


> milk] gu [n — 2k] 
k 
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The reconstructed output is thus given by 


Z[n] = Zo[n] + Z1[n] (A.35) 
= > yolk]go[n — 2k] + S_ yi [k]ou[n — 2k] (A.36) 
k k 
— > (a[m], ho [2k — m]) go[n — 2k] (A.37) 
k 


+ > (z[m], hi [2k — m]) gi[n — 2k] 
k 


2 


>— >< (2[m], hi[2k — m])gi[n — 2k]. (A.38) 


i=1 k 


Introducing the notation 
9i,z[2] = gi[n—2k] and ain = (2[m],hi[2k —m)), (A.39) 
the signal reconstruction can be clearly expressed as an atomic model: 


z[n] = S| ai,n9i,4(N). (A.40) 
i,k 


The coefficients in the atomic decomposition are derived by the analysis filter 
bank, and the expansion functions are time-shifts of the impulse responses of the 
synthesis filter bank. As noted earlier, the filter banks are interchangeable; the 
signal could also be written as an atomic decomposition based on the impulse 
responses h;,,[n]. In either case, the atoms in the signal model correspond to 
the synthesis filter bank. 

In this appendix, it has been shown that a critically sampled two-channel 
perfect reconstruction filter bank computes a signal expansion in a biorthogonal 
basis. Multiresolution decompositions such as the discrete wavelet transform 
and wavelet packets can be developed by iterating these two-channel structures. 
Here, it should simply be noted that the development in Equations (A.35) 
through (A.38) provides a connection between the interpretations of the wavelet 
transform as a filter bank model and as an atomic model; a subband signal is 
derived as an accumulation of weighted atoms corresponding to the impulse 
responses of the synthesis filter for that band. Such issues are discussed at 
greater length in Section 3.2.1. 


Appendix B 
Fourier Series Representations 


In Chapter 5, the Fourier series is applied to a pitch-synchronous signal rep- 
resentation to arrive at a pitch-synchronous sinusoidal model. The details of 
Fourier series methods are reviewed here. 


Complex Fourier series and the discrete Fourier transform. The Fourier 
basis for C is the set of harmonically related complex sinusoids: 


‘ei Wh = = for k=0,1,2,...,N— i}. (B.1) 


The complex Fourier series expansion for a signal z[n] € C% is then 


N-1 


a[n] = So c5el?hnl/N, (B.2) 
k=0 


where the coefficients c, are given by the formulation: 


N-1 N-1N-1 
> a[nje~J27!n/N _— > > cped2™(k—I)n/N (B.3) 
n=0 n=0 k=0 
N-1 
= > oNé6(k-1l) = Ne (B.4) 
k=0 
1 No 
_ —j2nkn/N 
— = S a[nje~J27 n/N (B.5) 


n=0 


216 ADAPTIVE SIGNAL MODELS 


This expression for cx is closely related to the discrete Fourier transform, which 
is given by the analysis and synthesis equations 


N-1 
X[k] = >= afnje?7*n/N Analysis (B.6) 
n=0 
, Na 
z[n] = N > X [k]e727#n/N Synthesis, (B.7) 
k=0 


where the analysis equation derives the DFT expansion coefficients or spectrum 
and the synthesis equation reconstructs the signal from those coefficients. Given 
the existence of fast algorithms for computing the DFT (i.e. the FFT), it is 
useful to note the simple relationship of the Fourier series coefficients and the 
DFT expansion: 


X{k] 


= (B.8) 


Real expansions of real signals. The Fourier expansion coefficients and 
the DFT spectrum are complex-valued even for real signals. For real signals, a 
real-valued expansion of the form 


N-1 
z{[n| = > QE COSWeN + bp sinwen (B.9) 
k=0 


can be derived using Euler’s equation: 


e/© = cosO + jsinO. (B.10) 
For real z[n], 
a[n] = Re{x|n]} = xol tel (B.11) 


Rewriting this using the complex Fourier expansion gives: 


a(n] = - cpetven + cre Juan (B.12) 
k=0 
{No 
= - c, (coswzn +7 sinw,n) + ch (coswzn —Jjsinw,n) (B.13) 
2 fo 
To (ck + cp -(cr-G\. 
= ( 5 ) eos. + i( 5 ) sinwan, (B.14) 
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The expansion coefficients in the Fourier cosine and sine series are thus given 
by: 


a = FE = Ref =  RetXlhlh 
4 N (B.15) 
— ;(%&2&) = _ — _imiXikl} | 
bp = i 5 ) = —-Im{q} = NO 
Furthermore, the spectrum of a real signal is conjugate-symmetric: 
X(k] = X(N — k]*, (B.16) 
which can be expressed in terms of the real and imaginary parts as 
Re{ X{k]} = Re{X[N _ k]} =—- Qa = QN—k 
(B.17) 
Im {X[k]} = —Im{X[N _— k]} => b, = —bn_r. 


This underlying symmetry can be used to halve the number of coefficients 
needed to represent z[n]. For odd N, the simplification is: 


N-1 
z[n} = > Qa, coswpn + bp sinw,zn (B.18) 
k=0 
N—1 
2 2 2 — 
= dg + 2. a, cos ( "| GN— COS (A) (B.19) 
+ 6; sin (7) + bn—, sin (Grea in) 
a 2nkn 
= a t+ D(a + ay-4) 008 ( NV ) (B.20) 
_ [{2nk 
+ (b, — bn—,) sin ( 7) 
N=1 
= 2rkn 2akn 
= 2 j . . 
Q@g + Dances ( N ) + bysin ( N ) (B.21) 
For even N, the result is: 
N_} 
= 2akn _ [2rkn 
a({n] = ag + ay/2cosT™n + 2 d Gx, COS ( NW ) + b;, sin ( V ) .  (B.22) 


In either case, the ap term corresponds to the average value of the signal. For 
even N, the an /2 term corresponds to the Nyquist frequency, at which the 
spectrum should have zero amplitude; for the remainder, it is assumed that 
an /2 = 0. 
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Magnitude-phase representations. The complex spectrum is often ex- 
pressed in terms of its magnitude and phase: 


X[k] = Re{X[k]} + jlm{X(k]} = [X[kle™, (B.23) 
where 
IX[k]| = JRe{X{k]}? + Im{Xkk]}? (B.24) 
_ Im{X|k]} 
Pr = arctan Ga . (B.25) 


The magnitude-phase representation is often of interest in audio applications 
because the ear is relatively insensitive to phase. With this as motivation, 
the sine-cosine expansion of real signals discussed above can be rewritten in 
magnitude-phase form based on the following derivation: 


acos©® + bsinO 


eJ© + eJ© eJ? — eJ® 
= 92 (252) 4 v0 (222) (B.27) 
2 2 
= fear t Persad 4 Fes /ak + Beeman (B28) 
V/a* + b* cos (e — arctan -) , (B.29) 


Substituting w,n for O, where w, = 27k/N, and incorporating a summation 
over k& yields another form for the sums in Equations (B.21) and (B.22): 


by 
z[n} = ao + 2 > az + b? cos (wan — arctan = (B.30) 
X [0] 2 
= W + WN 2 |X [k]| cos (WRN + dk) ; (B.31) 


where |X[k]| and ¢, are as defined in Equation (B.24) and k ranges over the 
half spectrum. As a check, note that: 


X[k] = N(a-ih) = od - jae) (B.32) 
= Re{X(k]} +ilm{X[k]} (B.33) 

= —ar an 2 = arc an LAXIk]} 
Pr = ct a, t Re(Xik}} (B.34) 


This magnitude-phase form is suggestive of the sinusoidal model of Chapter 
2. The connection is discussed in Section 5.3, where it is shown that some of 
the difficulties in sinusoidal modeling can be overcome by applying the Fourier 
series in a pitch-synchronous manner. 
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adaptive segmentation, 88, 100-114 
see also dynamic segmentation 
adaptive wavelet packets, see wavelet 
packets 
additive models, 3-4 
additive synthesis, 4, 64 
aliasing cancellation, 48, 91, 93, 96-98, 
125, 160 
time-domain, 35, 44-47 
amplitude envelope, 33, 37-41, 48-49, 58, 
66-67, 75-82, 117, 150-151, 
183, 197, 199 
distortion, 74-79 
amplitude interpolation, 66, 73-74, 150- 
151 
see also parameter interpolation 
analysis, 1, 8 
analysis window, 31 
analysis-by-synthesis, 62-63, 95, 110, 117, 
184 
analysis-synthesis, 1-6, 115-116 
artifacts, see reconstruction artifacts 
asymmetric windows, 113-114 
atomic models, 21, 26-28, 85-94, 152- 
153, 157, 167-200, 203—207 
relation to filter banks, 88, 93-94, 
191, 211-213 
time reference, 180 
see also granular analysis-synthesis 
atoms 
composite, 183-187, 189, 192-193, 
195-200 
damped sinusoids, 27, 167, 181-196, 
198, 200, 203 
Gabor, 21, 26-27, 167, 180-182, 
185-188, 200, 203 
sinusoidal, 85-88, 109, 203, 206 
time-frequency, 13, 21 
see also grains 
attack problem, 5, 81, 117 


see also pre-echo 
attacks, 5, 65, 79-82, 85, 87-88, 109-112, 
116-118, 145, 150, 164, 167 
audio 
broadcasting, 66, 204 
coding, 2, 5, 13, 46, 47, 88, 100, 
107, 111, 113-114, 118, 151- 
152, 161-162, 164, 203-205 
recording, 204 
transmission, 204 
auditory filter banks, 93, 96, 118-119 
auditory scene analysis, 165 
autocorrelation, 122 


backward methods, 177 
backward orthogonal matching pursuit, 
175-177 
Balian-Low theorem, 47 
bandlimited interpolation, 35 
basis expansions, 4, 9-15, 23, 24, 48, 52, 
168, 177 
biorthogonal, 10-11, 15, 16, 20-21, 
89-91, 168, 211-213 
orthogonal, 10-11, 20, 61, 134, 168, 
177 
orthonormal, 11 
shortcomings, 13-14, 19, 168 
bassoon, 84, 145, 146, 153, 154, 158 
best basis, 13, 15-16, 24, 59, 169 
bilinear expansions, 25 
bin frequency, 52 
biorthogonal basis, 10-11, 15, 16, 20-21, 
89-91, 168, 211-213 
wavelets, 11, 12, 91, 211-213 
birth, 65, 80, 148 
Blackman-Harris windows, 35, 69, 75 
block transforms, 12, 46-47, 108 
bow noise, 81 
breath noise, 30, 81, 116, 132, 138, 202 
brute force, 196 
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cells, 102 

center clipping, 7 

channel vocoder, 48-49 

characterization, 8 

chirps, 25, 41-44, 79 

circular convolution, 53, 69 

clarinet, 84 

codebook, 206 

coding gain, 161-162, 204, 205 

comb wavelets, 157 

compact models, 6-8, 13-28, 30, 42, 51, 
59-61, 79, 84, 114, 116, 124, 
147, 149, 153, 154, 158, 161- 
171, 178, 179, 200-207 

compaction, see compact models, com- 
pression, data reduction 

completeness, 14 

complexity, 170, 173, 178, 207 

composite atoms, 183-187, 189, 192-193, 
195-200 

compression, 5-8, 12, 14, 28, 50, 76, 78, 
92, 108, 111, 130, 151, 160, 
165, 168-170, 180, 201, 205 

see also audio coding, compaction, 

image coding 

computation-rate-distortion, 178-179, 186, 
206 

computational cost, 15, 32, 63, 71, 82, 98, 
101-113, 144, 170, 173, 178, 
183-187, 190, 195-199, 202 

computer music, 4, 22, 26, 64, 84, 160 

conjugate pursuit, 174-175, 193-195 

conjugate subspaces, 174-175, 193-195 

consistency, 48 

constant-Q filter banks, 96, 118 

continuous-time wavelet transform, 23, 88 

correlation update, 173, 183, 187, 195-199 

cosine-modulated filter banks, 42, 47, 203 

critical bands, 118-119 

critically sampled filter banks, 12, 16, 23, 
46-48, 51, 89-98, 125, 209-213 

cross terms, 25-26, 124, 207 

cross-synthesis, 8, 30, 81, 83-84, 147 

CWT, see continuous-time wavelet trans- 
form 


damped sinusoids, 27, 167, 180-196, 198, 
200, 203 
data reduction, 5, 8, 32, 43, 47, 50-51, 80, 
139, 151-152, 165, 201-206 
see also compaction, compression 
Daubechies wavelets, 93, 94, 153 
death, 65, 80, 148 
decompositions, 3 
atomic, see atomic models 
deterministic-plus-stochastic, 30-31, 
115, 135, 142-143, 162-164 


lowpass-plus-details, 13, 139, 153, 
161 
mixtures, see mixed models 
multiresolution, 22, 88-95, 152, 206, 
213 
octave-band, 13, 92-93, 98-99, 152- 
154 
periodic-plus-details, 157-162 
reconstruction-plus-residual, 30-31, 
115, 135, 162 
signal-plus-residual, 5, 30-31, 115, 
135 
time-frequency, 21—27 
voiced-unvoiced, 115, 142-143 
deletion, 146 
denoising, 6-7, 14, 16, 81-82, 170, 201 
dereferenced modulation, 180, 182-185, 
190, 192-193 
detection, 8, 182 
deterministic-plus-stochastic models, 30- 
31, 115, 135, 142-143, 162-164 
see also noiselike components 
DFT, see discrete Fourier transform 
dictionaries, 13 
design, 171, 186, 188, 206 
Fourier, 59-63 
hybrid, 183 
overcomplete, 13, 16-20, 27, 28, 59, 
167-180, 203, 205 
random, 179 
time-frequency, 13, 27, 167, 173, 
179-186, 203 
see also atoms 
undercomplete, 172 
discrete cosine transform, 2 
two-dimensional, 165 
discrete Fourier transform, see Fourier 
transform 
discrete wavelet transform, see wavelet 
transform 
discrete-time Fourier transform, 45, 67- 
70, 134-135, 148 
downsampling, 39, 91, 93-95, 144, 212 
DTFT, see discrete-time Fourier trans- 
form 
dual basis, 10, 20, 60, 168 
dual frame, 16, 20, 59 
DWT, see discrete wavelet transform 
dynamic programming, 63, 100-108, 111, 
114, 202, 204 
nodes, 104 
truncation depth, 102 
dynamic segmentation, 88, 101-110, 117, 
118, 124, 202, 204 


electrocardiogram signals, 6, 160, 165-166 
enhancement, 7, 81-82, 207 
entropy, 6, 16, 169 


entropy coding, 205 

equivalent rectangular bands, 118-125, 
129-132 

ERB, see equivalent rectangular bands 

evolution, 41, 44, 63, 64, 73, 74, 84, 86, 
94, 204, 206 

evolutionary models, 86-87, 93-94, 206 

excess bandwidth, 128 

expansions, 3 

see also basis expansions, overcom- 

plete expansions 


family, 23 
fast Fourier transform, 68, 72, 74, 103, 
130, 136, 187, 192, 216 
FBS, see filter bank summation 
FFT, see fast Fourier transform 
Fibonacci series, 103 
filter bank summation, 37, 39, 44, 45 
filter banks, 2, 28 
auditory, 93, 96, 118-119 
constant-Q, 96, 118 
cosine-modulated, 42, 47, 203 
critically sampled, 12, 16, 23, 46-48, 
51, 89-98, 125, 209-213 
design, 125-129 
ripple, 127-129 
transition regions, 127-129 
heterodyne, 36-39, 41, 193 
iterated, 12, 91-93, 153 
modulated, 38-41, 44, 51, 54, 84, 
132, 193 
nonuniform, 125-129 
octave-band, 12, 96-100, 125, 152- 
154 
oversampled, 16, 47, 48, 95-98, 125 
perfect reconstruction, 4, 12, 16, 41, 
47, 89-91, 98, 125-126, 152- 
155, 209-213 
recursive, 28, 167, 182, 187-200, 203 
signal-adaptive, 24, 44, 100, 114, 164 
tree-structured, 12, 16, 91-93, 97, 
153 
two-channel, 12, 89-93, 209-213 
wavelets, 11, 88-98, 164, 209-213 
fixed segmentation, 108-112, 118, 202 
flute, 30, 81, 116, 132, 138, 202 
formant-corrected pitch-shifting, 83, 146 
forward methods, 177 
forward orthogonal matching pursuit, 
177-178 
forward segmentation, 110-113 
Fourier basis, 11, 149, 168, 180, 215 
Fourier dictionaries, 59-63 
Fourier series, 30, 147-148, 152, 215-218 
Fourier transform, 14, 24, 87 
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discrete, 11, 31, 52-62, 67-73, 129- 
138, 144, 147-148, 191-192, 


215-218 

discrete-time, 45, 67-70, 134-135, 
148 

fast, 68, 72, 74, 103, 130, 136, 187, 
192, 216 

inverse, see inverse DFT, inverse 
FFT 

oversampled, 35, 51-62, 68, 71, 148, 
204 

pitch-synchronous, 148, 165, 203, 
218 

short-time, see short-time Fourier 
transform 

tiling, 25 


frame bounds, 15 
frames, 15-16, 169 
see also tight frames 
frames of complex sinusoids, 60-62 
frequency, 21 
see also localization, resolution 


Gabor, 21, 26-27, 32, 167, 180 

Gabor atoms, 21, 26—27, 62, 167, 180-182, 
185-188, 200, 203 

Gaussian window, 32, 54, 185, 186 

gender modification, 83 

geometric interpretations, 19-21, 79, 172, 
173 

global optimization, 63, 101-108, 110, 
111, 117 

glottis, 141 

grains, 26-28, 43, 87, 146, 147, 170, 203 

see also atoms 

Gram-Schmidt orthogonalization, 176- 
178 

granular analysis-synthesis, 9, 26-27, 85— 
87, 152, 157, 160-161, 166, 200 

granular synthesis, 4, 9, 26-27, 146 

granulation, 26, 27, 43, 86, 146-147, 160, 
203 

greedy algorithms, 20, 63, 65, 111, 171- 
173, 178-179, 199-200 

guitar, 3, 8, 83, 84 


Haar wavelets, 17-19, 155, 157 

Hamming window, 35, 75 

Hanning window, 35, 41-43, 53-58, 69, 
70, 75, 185-187 

harmonic structure, 8, 30, 43-44, 56-65, 
139, 140, 148-162, 192, 205 

Heisenberg, 22 

heterodyne filter banks, 36-39, 41, 193 

heuristic segmentation, 110-113 

hidden Markov models, 65, 101 

high-resolution matching pursuit, 181 

hybrid dictionaries, 183 
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hybrid window, 74-75, 113, 114, 137 


identification, 8 

IDFT, see inverse DFT 

IFFT, see inverse FFT 

ITR wavelets, 182 

image coding, 2, 5, 12, 13, 46, 95, 107, 
111, 153, 161, 165, 205 

image morphing, 84 

interpolation filters, 46 

inverse DFT, 34, 44, 67, 69, 71-76, 113- 
114, 117, 125, 129-137 

inverse FFT, 72, 74, 136 

inverse problems, 27, 168-171 

iterated filter banks, 12, 91-93, 153 


lapped orthogonal transforms, 2, 12, 47, 
108 
line tracking, 64-66, 75-78, 80, 109, 139, 
147-148, 205 
linear predictive coding, 9, 101, 102, 107, 
108, 115, 123, 142, 204 
local optimization, 110-111 
see also greedy algorithms 
localization, 12, 22-23, 31, 47, 53, 87-93, 
100, 108-114, 156, 167, 180 
frequency, 13, 21-22, 86, 93, 109, 
158 
scale, 21 
time, 12, 13, 21-22, 43, 56, 80-81, 
85-86, 93, 107-109, 162, 182 
see also resolution 
lowpass-plus-details models, 13, 139, 153, 
161 


magnitude-only reconstruction, 48, 67, 
76, 78, 151 
magnitude-phase representations, 73, 147— 
148, 218 
marimba, 30, 116 
markers, 104 
masking, 5, 6, 8, 116, 118, 125, 205 
matched filters, 191-192 
matching pursuit, 20, 27, 28, 59, 63, 167- 
200, 203-206 
backward orthogonal, 175-177 
conjugate pursuit, 174-175, 193-195 
correlation update, 173, 183, 187, 
195-199 
forward orthogonal, 177-178 
high-resolution, 181 
orthogonal, 175-179 
readmission, 175-178 
subspace pursuit, 173-178, 194-195 
mathematical models, 3-4 
matrix inversion lemma, 176, 177 
mean-square error, 3, 16, 63, 102, 109, 
172, 186, 188, 189 


mixed models, 115-117, 138, 142-143 
see also decompositions 
modification, 8-9, 14, 20-21, 23-30, 43, 
47-49, 62, 64, 66, 76, 81-84, 
97, 117, 124, 144-147, 151- 
152, 160-166, 201, 205-207 
see also cross-synthesis, pitch-shifting, 
time-scaling 
modulated filter banks, 38-41, 44, 51, 54, 
84, 132, 193 
modulation matrix, 210 
morphing, 84 
mother function, 23 
motif, 69-72, 79, 113-114, 137 
multidimensional scaling, 84 
multiplexed wavelet transform, 157-159, 
161 
multiresolution, 22-23, 28 
decompositions, 22, 88-95, 152, 206, 
213 
phase vocoder, 96, 97 
segmentation, 85 
sinusoidal modeling, 80, 84-114, 
116-118, 124, 138, 167, 170, 
202-205 
see also pyramids, wavelets 
multiresolution segmentation, 88, 100 


near-perfect reconstruction, 4—6, 47, 61, 
202 

nodes, 104 

noise perception, 117-126, 130, 134 

noiselike components, 30, 79, 85, 116, 124, 
138, 142, 162 

non-rectangular tiles, 22 

nonlinearity, 2, 7, 17, 26, 28, 59, 65, 77, 
95, 134, 169, 171, 207 

nonparametric models, 4, 8-21, 27, 28, 41, 
63, 81, 82, 84, 203 

nonuniform filter banks, 125-129 

nonuniform sampling, 68 

normalization, 130, 131, 135-138 

null space, 15, 21, 62, 168, 207 

Nyquist criterion, 35, 45, 126, 144, 217 


octave-band decompositions, 13, 92-93, 
98-99, 152-154 
octave-band filter banks, 12, 96-100, 125, 
152-154 
OLA, see overlap-add 
one-norm, 6 
optimality, 99, 101, 107-111, 160, 170, 
173, 179, 199, 202, 204, 205, 
207 
optimization, 99, 102, 172, 185, 204-206 
global, 63, 101-108, 111, 117 
local, 110-111 


metric, 3, 15, 16, 24, 101-111, 169- 
179, 185, 194 
orthogonal basis, 10-11, 20, 61, 134, 168, 
177 
orthogonal matching pursuit, 175-179 
orthogonality of components, 19-21, 60, 
62, 79 
see also readmission 
orthogonality principle, 172, 173, 175 
orthonormal basis, 11 
oscillator banks, 48-50, 63-67, 75, 148- 
149 
overcomplete dictionaries, 13, 16—20, 27, 
28, 59, 167-180, 203, 205 
overcomplete expansions, 9, 13-28, 59-62, 
167-171, 182, 199, 204, 206- 
207 
computation, 15, 19, 20, 59, 61, 169- 
171, 207 
overcompleteness, 13, 14, 28 
see also dictionaries 
overlap-add, 35-36, 39-40, 44, 45, 67, 
73-78, 113-114, 117, 124, 135- 
137, 147 
pitch-synchronous, 26, 145 
overlap-add property, 35-36, 50, 110, 113 
see also perfect reconstruction win- 
dows 
oversampled DFT, 35, 51-62, 68, 71, 148, 
204 
oversampled filter banks, 16, 47, 48, 95- 
98, 125 


packet-lossy networks, 204 

parallel methods, 169 

parameter interpolation, 28, 51, 64, 66- 
67, 73-80, 109, 205 

parametric models, 4, 8-9, 17, 27-29, 47- 
51, 62-63, 75-76, 81-88, 110, 
147-151, 160, 170-171, 179, 
200-204 

Parseval’s theorem, 123, 132, 134-138 

partial tracking, 44, 49, 204 

see also line tracking 


partials, 30 
peak picking, 51-62, 80, 135, 139, 147- 
148, 202, 205 


perceptual coding, 2 
see also masking, transparency 
perceptual entropy, 204 
perceptual losslessness, 5-6, 8, 115-116, 
120, 124, 135 
see also transparency 
perfect reconstruction, 4-6, 9, 31, 34-44, 
48, 50, 61, 89-91, 95, 97, 115, 
145, 148-153, 157-164, 170, 
177, 201-203, 209-213 
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perfect reconstruction filter banks, 4, 12, 
16, 41, 47, 89-92, 98, 125-126, 
152-155, 209-213 
perfect reconstruction windows, 35-36, 
41, 45-47, 73, 74 
periodic-plus-details models, 157~162 
phase 
atomic, 193-194 
insensitivity, 66, 76, 130, 151, 218 
interpolation, 66-67, 73, 147, 149- 
150 
see also parameter interpolation 
matching, 62, 66, 74, 77-79, 113, 149 
modeling, 75-79 
short-time Fourier transform, 33, 42 
sinusoidal, 41, 58, 149 
phase vocoder, 12, 31, 48-49, 86 
multiresolution, 96, 97 
physical models, 1-4, 6, 9 
piano, 84, 116 
pitch detection, 7, 139-143, 147 
phase-locked, 140-142 
pitch-period upsampling, 157-158, 162, 
163 
pitch-shifting, 8, 26, 30, 81-83, 146, 152 
formant-corrected, 83, 146 
pitch-synchronous filtering, 146 
pitch-synchronous Fourier transforms, 148, 
165, 203, 218 
pitch-synchronous models, 27-28, 139- 
166, 170, 202~—203 
pitch-synchronous overlap-add, 26, 145 
pitch-synchronous representation, 139- 
152, 161, 163, 166, 215 
pitch-synchronous segmentation, 140, 142- 
147, 157 
pitch-synchronous sinusoidal models, 27, 
56, 145, 147-152, 202, 215 
line tracking, 148 
peak picking, 56, 148 
synthesis, 148-151 
pitch-synchronous wavelet transforms, 27, 
145, 152-164, 202 
pitched, 140, 143 
polyphase transforms, 158-160 
power spectral density, 123-124 
pre-echo, 5, 47, 96 
atomic model, 180-181, 186-188 
short-time Fourier transform, 43 
sinusoidal model, 50, 51, 79-81, 88, 
96, 109-110, 112, 114, 202, 204 
wavelet transform, 162-164 
see also attack problem 
precision effects, 75, 162, 163 
prototype waveform coding, 9, 139 
prototype window, 38-42, 47, 50, 51, 54, 
187, 203 
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PSD, see power spectral density 
pseudo-inverse, 15-16, 19, 59, 169-170, 
207 
pseudo-periodic signals, 27, 30, 43, 56, 
139-166, 202, 203 
PSOLA, see pitch-synchronous overlap- 
add 
PSR, see pitch-synchronous representa- 
tion 
psychoacoustics, 84, 164, 165 
auditory models, 93, 96, 118-119 
critical bands, 118-119 
equivalent rectangular bands, 118- 
125, 129-132 
masking, 5, 6, 8, 116, 118, 125, 205 
noise perception, 117-126, 130, 134 
phase insensitivity, 66, 76, 130, 151, 
218 
pyramids, 95-98, 206 


quadratic time-frequency representations, 
25-26 
cross terms, 25-26, 207 
quantization, 7, 12, 43, 47, 98, 108, 204 
see also vector quantization 
quantization noise, 4, 46—47, 95 


raised cosine window, 128 
random dictionaries, 179 
rate-distortion, 6, 16, 61, 102, 107, 161, 
169, 179, 203, 204 
see also computation-rate-distortion 
readability, 25, 207 
readmission, 175-178 
real decompositions of real signals, 174, 
193-195, 216-217 
reallocation, 26 
reassignment, 26, 207 
reconstruction artifacts, 12, 23, 29, 43, 
46-47, 50, 67, 76-81, 84-87, 
108, 117, 160, 181, 186, 204 
see also pre-echo 
reconstruction-plus-residual models, 30— 
31, 115, 135, 162 
rectangular window, 35, 47, 51-58, 148, 
183-184 
recursive filter banks, 28, 167, 182, 187- 
200, 203 
redundancy, 14, 16, 47, 61, 67, 68, 101, 
104, 147, 151, 160, 165, 168, 
196, 197, 205 
refinement, 206-207 
see also successive refinement 
repetition, 146 
resampling, 143-146, 148, 152 
residual, 1 
sinusoidal model, 30-31, 79-82, 85, 
100, 110, 112, 115-118, 202 


see also decompositions 
residual modeling, 5, 27, 30-31, 80, 84, 
85, 115-138, 162, 202, 205 
analysis, 119-124, 130, 132-134, 
136, 138 
analysis-synthesis, 115-138 
normalization, 130, 131, 135-138 
synthesis, 119-125, 130-132, 135- 
138 
resolution, 16, 22—23, 25-26, 52, 79, 87- 
88, 96 
frequency, 22, 44, 46, 51-57, 69-71, 
97, 109, 129, 148, 162 
time, 12, 22, 66, 93, 97, 99, 113, 162 
tradeoffs, 22-26, 56, 71, 88, 91-93, 
100, 109, 114, 129, 167, 191 
see also localization, multiresolution 
resolution of harmonics, 56—57, 148 
ripple, 127-129 


sample rate conversion, 143 
see also resampling 
saxophone, 80, 81, 100, 110-112, 116, 118, 
131, 132 
scale, 21 
segmentation 
adaptive, 88, 100-114 
dynamic, 88, 101-110, 117, 118, 124, 
202, 204 
fixed, 108-112, 118, 202 
forward, 110-113 
heuristic, 110-113 
multiresolution, 85, 88, 100 
sequential methods, 62, 101, 169 
short-time energy, 48, 117, 119, 120, 135 
short-time Fourier transform, 2, 12, 27, 
31-57, 64, 75-76, 84, 86, 130- 
133, 192, 201-202 
interpretations, 33-34, 94, 192 
filter bank, 36—41 
time-localized spectra, 34-36 
see also FBS, OLA 
modification, 47-48, 75, 81, 82 
tiling, 25, 34, 54 
time reference, 33, 180, 193 
signal adaptivity, 4, 9, 13-17, 24, 27-28, 
44, 84-86, 94, 100, 102, 114, 
138, 147, 151, 163-167, 170- 
171, 200-203, 206 
signal analysis, 8, 169 
signal estimation, 8, 161 
signal modeling, 1 
signal-adaptive filter banks, 24, 44, 100, 
114, 164 
signal-plus-residual models, 5, 30-31, 115, 
135 


singular value decomposition, 15-16, 19- 
20, 28, 169-170, 207 
sinusoidal atoms, 85~88, 109, 203, 206 
sinusoidal modeling, 9, 13, 27, 29-84, 116, 
139, 142, 162, 166, 201-202, 
204—205, 218 
analysis, 51-63 
line tracking, 64—66, 75-80, 109, 139, 
147-148, 205 
multiresolution, 80, 84-114, 116- 
118, 124, 138, 167, 170, 202- 
205 
parameter interpolation, 51, 64, 66— 
67, 73-80, 109, 205 
peak picking, 51-62, 80, 139, 147- 
148, 202, 205 
pitch-synchronous, 27, 56, 145, 147- 
152, 202, 215 
residual, 30-31, 79-82, 85, 100, 110, 
112, 115-118, 202 
synthesis 
frequency-domain, 67-79, 125 
time-domain, 63-67 
zero-phase, 149-151 
smoothing window, 183-184 
source separation, 97, 165 
source-filter models, 2, 9, 82, 83, 115, 142, 
147, 200 
Sparse approximate models, 17, 28, 169-— 
172 
see also compact models 
spectral estimation, 63, 92, 117, 130-131, 
200 
spectral modification, 83, 160-161 
spectral sampling, 45, 67-72, 83, 148 
see also oversampled DFT 
spectrogram, 84 
spectrum, 216 
start windows, 114 
STFT, see short-time Fourier transform 
stop windows, 114 
structured audio, 9 
subband coding, 164 
suboptimality, 99, 107-111, 170, 173, 
177-178, 204, 207 
subspace projection, 173-175, 177, 179 
subspace pursuit, 173-178, 194-195 
successive refinement, 3, 26, 28, 95, 153, 
154, 169, 170, 179, 186, 204 
SVD, see singular value decomposition 
synthesis, 1 
synthesis window, 35 


thresholding, 6, 7, 12, 16, 170 

tight frames, 16, 61, 132, 134, 204 

tiles, 22-25, 152-153, 156 
non-rectangular, 22 
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tilings, 24, 34, 54, 94, 152, 153, 156 
timbre, 84 
timbre space, 83~84 
time, 21 
see also localization, resolution 
time reference 
atomic models, 180 
short-time Fourier transform, 33, 
180, 193 
sinusoidal models, 58 
time support, 12, 32, 87-88, 97-98, 109, 
182, 199 
time-domain aliasing, 35, 40, 44-47, 68, 
135 
time-frequency 
atoms, see atoms 
decompositions, 21-27 
dictionaries, 13, 27, 167, 173, 179- 
186, 203 
plane, 22—27, 153, 192, 207 
resolution, see resolution 
tiles, 22-25, 152-153, 156 
tilings, 24, 34, 54, 94, 152, 153, 156 
time-limited interpolation, 35, 54 
time-scaling, 8, 26, 30, 81-83, 97, 146, 152 
time-varying filters, 83 
time-varying windows, 113-114 
tracks, 30, 50, 65, 73, 78, 80, 148, 152, 204 
transients, 12, 21, 43, 65, 67, 79-80, 85- 
88, 97, 107, 108, 114, 117, 143, 
151, 162-167, 180-182, 203 
transition regions, 127-129 
transition windows, 114 
transparency, 5, 13, 116, 118, 124, 135, 
138, 151, 164 
see also perceptual losslessness 
tree-structured filter banks, 12, 16, 91-93, 
97, 153 
triangular window, 35, 74, 77, 113-114, 
137 
truncation depth, 102 
two-channel filter banks, 12, 89-93, 209- 
213 
two-dimensional transforms 
discrete cosine transform, 165 
wavelet transform, 160, 165 
two-norm, 15, 16, 169, 171-174, 185 


uncertainty principle, 22, 47 

undercomplete dictionaries, 172 

underdetermined systems, 168 

undersampling, 35, 68 

unpitched, 140, 143 

unvoiced, 115, 142 

upsampled wavelets, 153-157 

upsampling, 39, 82, 94, 144, 153-157 
pitch-period, 157-158, 162, 163 
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validity, 48 

variance, 121-123 

vector quantization, 171, 206 

vibrato, 143 

Viterbi algorithm, 65, 101-102 
vocoder, 48 

voiced, 115, 142 

voiced-unvoiced models, 115, 142-143 


wavelet packets, 12-13, 16, 24, 91-92, 96— 


97, 101, 107, 169, 180, 213 
tiling, 25 


wavelet transform, 2, 12-14, 19, 22-24, 


88-97, 114, 139, 152-155, 160— 
164, 166, 209-213 

continuous-time, 23, 88 

filter banks, 11, 12, 88-94, 98, 164, 
209-213 

multiplexed, 157-159, 161 

pitch-synchronous, 27, 145, 152- 
164, 202 

tilling, 25 

two-dimensional, 160, 165 


wavelets, 23, 88-94, 152-153 


basis, 168, 180 

biorthogonality, 11, 12, 91, 211-213 
comb, 157 

Daubechies, 93, 94, 153 

Haar, 17-19, 155, 157 

IIR, 182 

upsampled, 153-157 


Wigner-Ville distribution, 25 


window method, 127-129 
window switching, 100, 114, 203 
windows 


analysis, 31, 57 

asymmetric, 113-114 

Blackman-Harris, 35, 69, 75 

design, 35-36, 45-47, 57, 69-73, 
185-186 

Gaussian, 32, 54, 185, 186 

Hamming, 35, 75 

Hanning, 35, 41-43, 53-58, 69, 70, 
75, 185-187 

hybrid, 74-75, 113, 114, 137 

overlap-add property, 35-36, 110, 
113 

perfect reconstruction, 35-36, 41, 
45-47, 73, 74 

prototype, 38-42, 47, 50, 51, 54, 187, 
203 

raised cosine, 128 

rectangular, 35, 47, 51-58, 148, 183- 
184 

smoothing, 183-184 

start, 114 

stop, 114 

synthesis, 35 

time-varying, 113-114 

transition, 114 

triangular, 35, 74, 77, 113-114, 137 


zero padding, 53-54, 61, 145, 192 
zero-phase sinusoidal modeling, 149-151 


